Guideline-Optimized Large-Language Models for Colonoscopy and Pathology Interpretation: A Three-Model Comparative Study Revealing Distinct Performance Behaviors

This media is currently not available.

A. Kahwagi

P. Wehbe

A. Rita

B. Boustani

R. Haddad

M. Njeim

E. Mahfouz

J. Amara

K. Honein

R. Slim

C. Yaghi

Poster Abstract

Aims

Large-language models (LLMs) are increasingly used to support colonoscopy report interpretation, but their diagnostic reliability and behavior before and after guideline optimization have not been systematically evaluated. This study compared the accuracy and completeness of ChatGPT, DeepSeek and Copilot when interpreting colonoscopy and corresponding anatomopathology reports, assessed the effect of structured evidence-based prompting, examined cost-related practical considerations and identified performance domains in which any model surpassed ChatGPT.

Methods

Ninety-one colonoscopy reports combined with their available anatomopathology results were processed by each LLM in two conditions: before and after receiving standardized guideline-based instructions. Each model was asked to interpret the full clinical sequence, integrating both endoscopic findings and pathology outcomes. The dataset was derived from real-world procedures performed at Hôtel-Dieu de France, Beirut, Lebanon, ensuring clinically authentic case representation. Two binary outcomes were recorded: accuracy, defined as correct interpretation of report and pathology correlation, and completeness, defined as inclusion of all essential reporting elements. Paired analyses used the McNemar test. Comparisons were conducted within each model pre- vs post-guidelines and between models at each timepoint.

Results

Across all three LLMs, the most frequent pre-guideline error was incorrect assignment of surveillance intervals, reflecting insufficient integration of evidence-based follow-up rules before structured prompting. ChatGPT demonstrated the largest improvement after guideline optimization, with accuracy increasing from 56.0% to 92.3% (p < 0.001) and completeness from 79.1% to 92.3% (p < 0.001). DeepSeek showed a moderate increase in accuracy (53.8% to 70.3%, p < 0.01) but a significant decline in completeness (46.1% to 29.7%, p < 0.001), suggesting a restrictive or “restorative” behavior where structured prompting leads the model to remove rather than expand essential descriptors. Copilot improved in both domains, with accuracy rising from 57.1% to 86.8% and completeness from 57.1% to 81.3% (p < 0.001).

Before guidelines, accuracy did not differ significantly between models. Completeness, however, differed markedly, with Copilot achieving the highest pre-guideline completeness at 95.6%, outperforming both ChatGPT and DeepSeek. This was the only performance domain in which ChatGPT was surpassed. After guideline prompting, ChatGPT achieved the highest overall performance. Its accuracy significantly exceeded DeepSeek’s and was comparable to Copilot’s, while its completeness remained significantly higher than DeepSeek and borderline higher than Copilot.

Practical considerations included the fact that ChatGPT was the only paid model, whereas DeepSeek and Copilot were freely accessible. Although ChatGPT delivered the best overall post-guideline performance, cost differences may influence accessibility, equity and feasibility in low-resource settings. Copilot, as a free tool demonstrating strong accuracy and completeness after optimization, represents a scalable alternative when paid models are not feasible.

Conclusions

Guideline prompting markedly enhanced LLM performance in the interpretation of colonoscopy and anatomopathology findings. ChatGPT showed the most consistent and robust improvements and was significantly superior to DeepSeek in all post-guideline comparisons. Copilot outperformed ChatGPT in pre-guideline completeness and remained a strong free alternative after optimization. DeepSeek’s decline in completeness highlights functional limitations and a restrictive response to structured guidance. Overall, ChatGPT was the most reliable model, while Copilot offers a promising cost-accessible option for endoscopic reporting support. As LLMs become increasingly accessible to lay people seeking guidance after their colonoscopy, our findings highlight the importance of directing them toward models that prioritize accuracy and completeness, ensuring that patient self-education remains aligned with evidence-based recommendations.

Download the app

The congress at your fingertips

Aims

Methods

Results

Conclusions