This media is currently not available.
Comparative Evaluation of Multimodal Large Language Models for Computer-Assisted Endoscopic Assessment in Ulcerative Colitis
Poster Abstract

Aims

To systematically compare the diagnostic accuracy of five contemporary multimodal large language models (MLLMs: Gemini-2.5-Pro, Grok-4, GPT-4o, GPT-5, and Qwen-VL-Max) in evaluating the Mayo Endoscopic Score (MES) for ulcerative colitis (UC), and to explore their consistency and performance across various intestinal segments and MES categories.

Methods

A total of 402 authentic endoscopic images from patients with UC were collected, covering the entire colon location from the ileocecal region to the rectum. Three experienced inflammatory bowel disease (IBD) experts independently reviewed and unanimously graded these images and finally 283 images with consensus on regrades. These images were among the second stage of research and the grade as the reference standard. These images were randomly presented to MLLMs and two senior IBD physicians without specifying the intestinal segment, and then randomly presented to MLLMs with segmental information before grade. Model and physician performance were compared, and stratified analyses were conducted by intestinal segment and MES grade.

Results

The diagnostic accuracies (Acc) of the two IBD physicians were 81.6% and 78.4%, respectively, with strong inter-observer agreement (κ = 0.692). Among these MLLMs), GPT-5 achieved the highest overall performance (F1: GPT-5 0.720 > GPT-4o 0.602 > Gemini-2.5-Pro 0.480 > Grok-4 0.415 > Qwen-VL-Max 0.338), and its diagnostic accuracy was comparable to that of human physicians (GPT-5 Acc 71.7% vs. Senior Physician 2 Acc 78.4%, P = 0.068). The other models exhibited significantly lower diagnostic performance compared with experienced IBD physicians (all P<0.001). When additional segmental information was provided, the sigmoid colon was the most accurately assessed region (mean F1 across models 0.682), whereas the rectum and ileocecal region remained the most challenging (0.447 and 0.493, respectively). The provision of segmental information significantly enhanced the performance of the lower-performing models. Moreover, both the models and human physicians showed the lowest accuracy at MES = 1(physicians mean Acc ≈ 60.3 %, models  mean Acc ≈ 39.4 %), indicating that the mild-activity grade remains the most challenging to classify due to its inherent subjectivity.

Conclusions

GPT-5 demonstrated diagnostic performance comparable to that of senior IBD physicians in MES grading, whereas other MLLMs require further optimization. In terms of intestinal segments, the rectum and ileocecal region, and in terms of disease severity, mild activity (MES = 1), represented common challenges for both physicians and models. Future efforts should focus on targeted training for these challenging segments and grades, and on integrating additional clinical multimodal data to advance the clinical implementation of intelligent endoscopic assessment.