This media is currently not available.
From Detection to Decision: Impact of Guideline-Guided Reasoning on the Performance of AI Models in Upper GI Endoscopy
Poster Abstract

Aims

Artificial intelligence (AI) is increasingly used in upper gastrointestinal endoscopy (UGIE), particularly for early detection of esophageal and gastric neoplasia. However, its ability to integrate clinical reasoning and generate management recommendations has not been formally evaluated. Recent work suggests that next-generation AI should move beyond visual detection toward contextual interpretation supported by hybrid architectures combining computer vision with large language models (LLMs) aligned with evidence-based guidelines.This study aimed to evaluate the capacity of ChatGPT and DeepSeek to interpret UGIE reports and issue follow-up or treatment recommendations before and after the structured integration of international guideline frameworks.

Methods

A comparative analysis was conducted at Hôtel-Dieu de France in Beirut using ninety-five UGIE reports and corresponding pathology results. Both ChatGPT 5.0 and DeepSeek interpreted each report under two independent conditions: (1) without guidelines, and (2) after structured integration of AGA, ACG, ASGE, and ESGE recommendations. Interpretation accuracy and completeness were assessed using a binary scoring system (1 = correct, 0 = incorrect). Statistical analysis was performed in R using McNemar’s test with continuity correction, with significance set at p<0.05.

Results

Before guideline integration, both models demonstrated identical accuracy (57.9%), while completeness reached 41.1% for ChatGPT and 46.3% for DeepSeek (p<0.05). Recurrent errors involved inappropriate surveillance recommendations, particularly for sporadic non-dysplastic fundic gland polyps and focal antral intestinal metaplasia, potentially exposing patients to unnecessary procedures.After guideline integration, performance improved significantly for both models (p<0.001): ChatGPT reached 97.9% accuracy and 95.8% completeness, while DeepSeek achieved 81.1% accuracy and 72.6% completeness. ChatGPT remained significantly superior across all parameters (p<0.001), and most initial systematic errors were corrected after guideline incorporation.

Conclusions

Without guideline guidance, AI models show major limitations and tend to over-recommend surveillance, failing to account for procedure-related risks. The initially similar performances of DeepSeek (free) and ChatGPT 5.0 (paid) indicate that commercial access alone does not ensure reliability without validated guideline integration, illustrating the bias that may arise when patients interpret their own reports using unguided AI.Integrating evidence-based recommendations dramatically improves accuracy and consistency, with ChatGPT showing clear post-guidance superiority. This approach may support clinicians in standardizing decisions and enhancing safety and quality throughout the endoscopic care pathway.