This media is currently not available.
Generative AI models for quality assessment in upper gastrointestinal lymphadenectomy: a video-based analysis
Poster Abstract

Aims

D2 lymphadenectomy is the standard of care for locally advanced gastric cancer but suffers from significant technical variability and lack of standardized quality assessment. Artificial Intelligence (AI), particularly Multimodal Large Language Models (MLLMs), offers a potential solution for objective surgical evaluation. We aimed to evaluate the performance of state-of-the-art MLLMs in assessing the quality and visibility of D2 lymphadenectomy.

Methods

In this comparative study, 20 videos of minimally invasive gastrectomy/esophagectomy (10 dissection phase, 10 clean field) were analyzed. Three MLLMs (GPT-5, Gemini 2.5 Pro, Gemini 3 Pro Preview) were tested across fifteen configurations varying in frame count (10, 30, 50) and temperature (0, 1). The primary outcomes were “Quality of Dissection” and “Anatomical Visibility,” assessed on a 4-point Likert scale against a reference standard of three expert surgeons. Performance was evaluated using dichotomized outcome (1-3 vs 4) with AUC, sensitivity, and specificity, as well as pointwise ordinal metrics to capture both clinical relevance and granular accuracy.

Results

Twenty videos were analyzed in two different categories (10 clean fields, 10 dissection phases; mean duration 100.4 ± 30.8 seconds). The best configuration was Gemini 2.5 Pro (50 frames, T=1) in dichotomous quality assessment (sensitivity 40.9% [27.7-55.6%], specificity 77.8% [61.9-88.3%], AUC 0.593 [0.492-0.692]) and visibility (AUC 0.663 [0.553-0.763]). Stable performance was shown across different phases and moderate ordinal accuracy (quality: exact 36.2%, MAE 1.075; visibility: exact 55.0%, MAE 0.838). Interobserver agreement was substantial for quality (ICC 0.736 [0.65-0.81]) and moderate for visibility (ICC 0.659 [0.55-0.75]).

Conclusions

Generative AI models demonstrate promising capability in the automated assessment of surgical field quality of D2 lymphadenectomy. Gemini 2.5 Pro offered the best balance of accuracy and efficiency. Future efforts should focus on visibility-based training on multicenter datasets to develop standardized, objective quality control tools.