Assessing Performance of Open-Source Vision-Language Models for Colorectal Polyp Sizing: A Comparison with Endoscopist Visual Estimation

This media is currently not available.

E. Cristea

N. Mashayekhi

C. Gefflot

M. Kandlikar-bloch

V. Michal

C. Hassan

D. Rex

A. Barkun

R. Djinbachian

N. Shahidi

S. Grover

R. Battat

D. Von renteln

Poster Abstract

Aims

Accurate colorectal polyp measurement is essential because size guides surveillance intervals, resection strategy, and colorectal cancer risk assessment. Emerging Vision-Language Models (VLMs) can estimate polyp size from endoscopic images, but their performance relative to expert visual assessment is unclear. We compared the accuracy and bias of visual assessment versus several VLMs for polyp sizing.

Methods

At the Centre Hospitalier de l’Université de Montréal (CHUM), patients were prospectively enrolled in an endoscopic video databank with consent for AI use. For each resected polyp, endoscopists provided a visual size estimate; still images were then submitted to four VLMs. The reference standard was microscopic size obtained from freshly resected specimens. VLMs generated estimates based on prompt-driven input, following few-shot learning (FSL). The primary outcome was categorical accuracy in three clinically relevant classes: diminutive (≤5 mm), small (5.001–9.999 mm), and large (≥10 mm). Secondary outcomes included continuous size bias, overestimation frequency (categorical and continuous), intraclass correlation coefficient (ICC) between VLM predictions, and Light’s κ for categorical agreement between VLMs. Logistic regression (for accuracy and frequencies) and linear models (for bias) with sizing method as a fixed effect were fitted, using generalized estimating equations with exchangeable correlation to account for within-polyp clustering. Tukey correction was applied for pairwise comparisons.

Results

We analyzed 132 polyps (78.0% diminutive, 21.2% small, 0.8% large). Visual assessment had higher categorical accuracy (81.7%; 95% CI, 71.4–87.4) than all VLMs, significantly so for 3 of 4 models (e.g., Qwen3-VL 67.0% [58.4–74.5]; difference visual–Qwen3-VL 14.7% [1.4–28.1]; p=0.022). Gemma3 showed the lowest accuracy (26.5% [19.7–34.7]). Three VLMs had mean positive bias >0.5 mm (Qwen3-VL 0.95 mm [0.53–1.38]; Llama4 1.35 mm [0.89–1.81]; Gemma3 2.71 mm [2.37–3.04]). Mistral-Small3.2 showed a smaller mean bias than visual assessment (difference visual–Mistral-Small3.2 0.39 mm [–0.43–1.21]) and a lower overestimation frequency (43.9% [35.7–52.5] vs visual 62% [53.4–69.9]; difference -18.1% [-33.3– -2.8]; p=0.011). Overall, VLMs tended to overestimate continuous sizes more than visual assessment. Agreement between VLMs was poor (ICC 0.212 [0.122–0.313]; κ=0.106); excluding Gemma3 did not materially improve agreement.

	Accuracy (%, 95% CI)	Bias (mm, 95% CI)	Categorical size over-estimation (%, 95% CI)	Size over-estimation (%, 95% CI)
By sizing method
Visual	81.7% (74.1,87.4)	0.58 (0.32,0.84)	12.3% (7.7,19.1)	62.0% (53.4,69.9)
VLM: Qwen3-VL	67.0% (58.4,74.5)	0.95 (0.53,1.38)	21.5% (15.3,29.4)	62.3% (53.7,70.2)
VLM: Llama4	68.9% (60.5,76.3)	1.35 (0.89,1.81)	18.2% (12.5,25.7)	73.5% (65.3,80.3)
VLM: Gemma3	26.5% (19.7,34.7)	2.71 (2.37,3.04)	72.0% (63.7,79.0)	86.4% (79.4,91.2)
VLM: Mistral-Small3.2	67.4% (59.0,74.9)	0.19 (-0.37,0.76)	16.7% (11.2,24.0)	43.9% (35.7,52.5)

Table 1. Primary and secondary outcome point estimates (95% CI). Results arise from logistic and linear (bias) models with GEEs that account for intra-polyp correlations.

Conclusions

In this study, standard visual assessment outperformed current VLMs for clinically relevant size categories, and VLMs frequently overestimated size. Likely contributors include the absence of calibrated reference scales, fisheye lens distortion, and variability in endoscopic distance and angulation. Polyp sizing remains a complex task, VLMs should not be used for polyp sizing and require rigorous size-aware training and calibration before clinical integration.

Download the app

The congress at your fingertips

Aims

Methods

Results

Conclusions