Achalasia and Esophageal Diverticula: Comparative Performance of ChatGPT-5 and Gemini in Therapeutic Decision-Making

Return

This media is currently not available.

M. Amine

S. Mrabti

A. Benhamdane

I. Karam

S. Azammam

C. Hiji

T. Addajou

I. Elkoti

R. Fedoua

H. Seddik

Poster Abstract

Aims

Conversational artificial intelligence (AI) models are increasingly used as clinical decision-support tools. This study aimed to compare ChatGPT (GPT-5) and Gemini (Google) in their ability to provide safe, justified, and guideline-compliant therapeutic strategies for esophageal achalasia and diverticula, with particular emphasis on the indication for Per Oral Endoscopic Myotomy (POEM)

Methods

Twenty-two simulated clinical cases were developed: 18 achalasia cases (types I, II, III, atypical, post-radiotherapy, post-dilation, or post-POEM) and 4 esophageal diverticula cases (epiphrenic and Zenker, symptomatic and asymptomatic). Cases included complex clinical scenarios such as cardiac or neurological comorbidities, megaesophagus, pre-existing reflux, or post-treatment recurrence.

Each case was independently submitted to both AI models using identical phrasing. The primary question was:

“Is POEM indicated for this patient? Justify your answer.”

Responses were evaluated by a clinical expert using five criteria: therapeutic relevance, clinical justification, safety, clarity, and adherence to guidelines. Each criterion was scored 0–2, except for clarity and guideline references (0–1), for a maximum score of 8 points. Qualitative analyses were conducted to assess overall trends, response structure, and adherence to international recommendations and patient safety.

Results

ChatGPT achieved a higher global score (96%) than Gemini (89.8%). Both models reached identical performance for therapeutic relevance (90.9%) and clarity (100%). ChatGPT outperformed Gemini in safety (100% vs. 95.5%) and explicit guideline citation (86% vs. 27%).

Integrated examples revealed consistent differences in approach:

-In an elderly patient with megaesophagus and cardiac comorbidity, ChatGPT recommended Botox or pneumatic dilation prior to POEM, demonstrating a risk-adapted, guideline-aligned strategy. In contrast, Gemini suggested POEM directly, acknowledging its efficacy but without referencing guideline-based precautions for fragile patients.

-In a post-fundoplication achalasia recurrence, ChatGPT advised caution due to altered anatomy and emphasized alternative treatments, whereas Gemini proposed POEM as a technically feasible option, illustrating its more procedural and pragmatic orientation.Qualitative trends emerged clearly across cases.

ChatGPT consistently adhered to ACG/ASGE/WGO guidelines, provided structured bullet-point responses, and privileged less invasive or staged approaches in high-risk scenarios. Gemini delivered more detailed, educational explanations of pathophysiology and procedural mechanics but was less concise and occasionally favored aggressive interventions even in complex situations. ChatGPT justified decisions through explicit links between physiology, expected outcomes, and risks, while Gemini relied more on practical reasoning and technical consideration

Conclusions

Both AI models produced clinically relevant and safe recommendations. However, ChatGPT demonstrated superior guideline adherence, structured clarity, and risk-sensitive therapeutic reasoning. Gemini offered richer pedagogical explanations and a pragmatic technical viewpoint. Their complementary strengths suggest potential value in combined use for clinical decision-support and medical training in disorders such as achalasia and esophageal diverticula.

Download the app

The congress at your fingertips

Aims

Methods

Results

Conclusions