Aims
Multimodal large language models (MLLMs) demonstrate potential in assessing the mayo endoscopic score (MES) for ulcerative colitis (UC). However, they primarily focus on single tasks and lack the capability for integrated analysis. To address these limitations, we developed a multitask system based on the Swin Transformer V2 architecture, which offers improved accuracy in MRS grading, segment-located capabilities, and inflamed intestinal mucosa area annotation features.
Methods
MES classification and colonic segment localization models were trained using the LIMUC datasets (11276 images) and Real-Colon datasets (622 images) respectively, with ten-fold cross-validation based on patient ID. An external validation set (402 images) from wuxi people’s hospital was independently annotated by three experienced experts. The MES scores with a unanimous consensus were included as the reference standard (283 images). Trained with AdamW, cosine annealing learning rate, and standard data augmentation, the image net-pretrained Swin Transformer V2 used weighted cross-entropy for MES, and cross-entropy for segment localization. In addition, four contemporary MLLMs (GPT-4o, Gemini-2.5 Pro, Grok-4, Qwen-VL-Max), which demonstrated potential, were incorporated for comparative analysis. Inflamed intestinal mucosa area-annotated ability was shown as a real-time Open CV-based prototype enabled frame-by-frame colonoscopy video prediction, and was evaluated via occlusion-based annotation diagram by the result of three experienced experts on external dataset.
Results
The Swin Transformer V2 model achieved better performance on public datasets (accuracy = 0.767; F1 = 0.692), and the best accuracy in four contemporary MLLMs was 0.443 for GPT-4o. The Swin Transformer V2 model also retained good performance on external dataset (accuracy = 0.654; F1 = 0.604), and the best accuracy in four contemporary MLLMs was 0.594 for GPT-4o. For segment localization on external dataset, the Swin Transformer V2 model obtained an accuracy of 0.526 and an F1 score of 0.508, which was better than the best contemporary MLLMs GPT-4o (accuracy = 0.220; F1 = 0.264). It shown obvious consistency between model annotated inflamed intestinal mucosa area and experts-annotated area.
Conclusions
The Swin Transformer V2-based multitask system outperformed general-purpose MLMMs in MES classification, segment localization, and inflamed intestinal mucosa area-annotation. This transparent model demonstrates strong potential to enhance UC endoscopic assessment and support clinicians across experience levels.