Aims
Endoscopic reporting has traditionally required physicians to manually document findings, procedural maneuvers, and recommendations—a time-consuming task that contributes to physician burnout. Recent advances in ambient clinical documentation have reduced this burden, enabling greater focus on high-value, skill-based activities. Emerging large language models (LLMs) with vision-language capabilities offer the potential to further streamline workflows and enhance diagnostic accuracy through automated image interpretation. This study evaluated the performance of a state-of-the-art vision-language LLM for endoscopic image classification and report generation.
Methods
We fine-tuned Google’s MedGemma-4B LLM using the HyperKvasir dataset, comprising 9,808 labeled GI endoscopic images across 23 categories representing both anatomical landmarks and pathological findings from upper and lower GI examinations. Images were randomly divided into training/validation (80%; 7,686/854) and testing (20%; 2,122) sets. The model underwent supervised fine-tuning, and performance metrics were assessed before and after training.
Results
Fine-tuning yielded large performance gains across the full test set (Figure 1). On all samples, overall accuracy increased from 0.13 at baseline to 0.86 after fine-tuning, with macro-averaged F1 rising from 0.07 to 0.54; discrimination approached ceiling performance with area under the receiver operating characteristic curve (AUROC) improving from 0.96 to 0.99. For labels with ≥100 training images, performance was uniformly strong, with both accuracy and macro-F1 near 0.97 and AUROC ≈0.996. In contrast, classes with <100 training images showed greater variability (Figure 2). Visually distinctive but infrequent findings, such as impacted stool and ulcerative colitis with Mayo endoscopic score 3, remained accurately classified despite limited sample size.
Conclusions
Supervised fine-tuning of LLM substantially enhanced endoscopic image classification, achieving near-ceiling discrimination for well-represented classes, with variable performance for underrepresented labels. These results support immediate utility for focused tasks such as Boston Bowel Preparation Scale grade assignment, anatomic landmark recognition, and polyp identification, with anticipated gains with additional training data. Future work should incorporate unstructured textual inputs and free-text label prediction to better mirror clinical documentation. The framework’s scalability, adaptability to heterogeneous inputs, and flexibility in outputs position it for clinically relevant applications (e.g., automated endoscopy reporting) and for academic use in large-scale curation of image repositories. Extension to multimodal data streams may enable predictive analytics (e.g., pathology forecasting), offering a promising avenue for translational research.