Aims
Accurate colorectal polyp measurement is essential because size guides surveillance intervals, resection strategy, and colorectal cancer risk assessment. Emerging Vision-Language Models (VLMs) can estimate polyp size from endoscopic images, but their performance relative to expert visual assessment is unclear. We compared the accuracy and bias of visual assessment versus several VLMs for polyp sizing.
Methods
At the Centre Hospitalier de l’Université de Montréal (CHUM), patients were prospectively enrolled in an endoscopic video databank with consent for AI use. For each resected polyp, endoscopists provided a visual size estimate; still images were then submitted to four VLMs. The reference standard was microscopic size obtained from freshly resected specimens. VLMs generated estimates based on prompt-driven input, following few-shot learning (FSL). The primary outcome was categorical accuracy in three clinically relevant classes: diminutive (≤5 mm), small (5.001–9.999 mm), and large (≥10 mm). Secondary outcomes included continuous size bias, overestimation frequency (categorical and continuous), intraclass correlation coefficient (ICC) between VLM predictions, and Light’s κ for categorical agreement between VLMs. Logistic regression (for accuracy and frequencies) and linear models (for bias) with sizing method as a fixed effect were fitted, using generalized estimating equations with exchangeable correlation to account for within-polyp clustering. Tukey correction was applied for pairwise comparisons.
Results
We analyzed 132 polyps (78.0% diminutive, 21.2% small, 0.8% large). Visual assessment had higher categorical accuracy (81.7%; 95% CI, 71.4–87.4) than all VLMs, significantly so for 3 of 4 models (e.g., Qwen3-VL 67.0% [58.4–74.5]; difference visual–Qwen3-VL 14.7% [1.4–28.1]; p=0.022). Gemma3 showed the lowest accuracy (26.5% [19.7–34.7]). Three VLMs had mean positive bias >0.5 mm (Qwen3-VL 0.95 mm [0.53–1.38]; Llama4 1.35 mm [0.89–1.81]; Gemma3 2.71 mm [2.37–3.04]). Mistral-Small3.2 showed a smaller mean bias than visual assessment (difference visual–Mistral-Small3.2 0.39 mm [–0.43–1.21]) and a lower overestimation frequency (43.9% [35.7–52.5] vs visual 62% [53.4–69.9]; difference -18.1% [-33.3– -2.8]; p=0.011). Overall, VLMs tended to overestimate continuous sizes more than visual assessment. Agreement between VLMs was poor (ICC 0.212 [0.122–0.313]; κ=0.106); excluding Gemma3 did not materially improve agreement.
|
|
Accuracy (%, 95% CI) |
Bias (mm, 95% CI) |
Categorical size over-estimation (%, 95% CI) |
Size over-estimation (%, 95% CI) |
|
By sizing method |
||||
|
Visual |
81.7% (74.1,87.4) |
0.58 (0.32,0.84) |
12.3% (7.7,19.1) |
62.0% (53.4,69.9) |
|
VLM: Qwen3-VL |
67.0% (58.4,74.5) |
0.95 (0.53,1.38) |
21.5% (15.3,29.4) |
62.3% (53.7,70.2) |
|
VLM: Llama4 |
68.9% (60.5,76.3) |
1.35 (0.89,1.81) |
18.2% (12.5,25.7) |
73.5% (65.3,80.3) |
|
VLM: Gemma3 |
26.5% (19.7,34.7) |
2.71 (2.37,3.04) |
72.0% (63.7,79.0) |
86.4% (79.4,91.2) |
|
VLM: Mistral-Small3.2 |
67.4% (59.0,74.9) |
0.19 (-0.37,0.76) |
16.7% (11.2,24.0) |
43.9% (35.7,52.5) |
Table 1. Primary and secondary outcome point estimates (95% CI). Results arise from logistic and linear (bias) models with GEEs that account for intra-polyp correlations.
Conclusions
In this study, standard visual assessment outperformed current VLMs for clinically relevant size categories, and VLMs frequently overestimated size. Likely contributors include the absence of calibrated reference scales, fisheye lens distortion, and variability in endoscopic distance and angulation. Polyp sizing remains a complex task, VLMs should not be used for polyp sizing and require rigorous size-aware training and calibration before clinical integration.