This media is currently not available.
Large language model-driven analysis and report generation of endoscopy videos - a pilot study
Poster Abstract

Aims

Multimodal large language models (MLLMs) can automatically analyze clinical video, but evidence from full esophagogastroduodenoscopy (EGD) and the impact of on‑screen computer‑aided detection/diagnosis (CAD) overlays on MLLM behavior remains unclear. We tested whether an MLLM can produce clinically adequate EGD reports and whether a CAD overlay changes performance.

Methods

We analyzed five complete EGD videos with Gemini 2.5 Pro in paired versions: 1) clean video  and 2) the same video with a CAD overlay. Five blinded endoscopists rated report adequacy in three domains. MLLM accuracy for landmarks/lesions was further assessed by two blinded expert endoscopists using time‑window rule (a model detection counted as correct if it occurred within ±2 seconds of the expert‑annotated timestamp).

Results

In this retrospective pilot study, five archived diagnostic EGD procedures from five patients were available as full-length videos. Across five raters, MLLM Completeness was judged adequate in 56.0% (14/25 ratings) with Clean‑Video versus 48.0% (12/25 ratings) with Overlay‑Video (p=0.500). Visualization was identical (36.0% [9/25 ratings] for both; p=1.000). Lesions characteristics was identical (16.0% [4/25] for both; p=1.00). For the Landmark agreement the overall accuracy of the MLLM with Clean-Video vs Overlay-Video was: 0.55 [95% CI 0.43–0.67] vs 0.33 [0.23–0.46], p=0.029; sensitivity 0.53 [0.40–0.66] vs 0.35 [0.24–0.49], p=0.122; specificity 0.67 [0.35–0.88] vs 0.22 [0.06–0.55], p=0.125.

Conclusions

In its current form, Gemini 2.5 Pro cannot report upper endoscopy findings appropriately for clinical use, and substantial task-specific optimization and validation are required before deployment.