Aims
Capsule endoscopy reading is time-consuming and subject to variability among readers. Although artificial intelligence–assisted reading has improved diagnostic accuracy, its performance often lacks generalizability in different clinical settings. This study aimed to develop a foundation model to improve accuracy and generalizability in capsule endoscopy reading
Methods
The Attention-based Token Mixing (AToM) model was developed by integrating two validated foundation models, BiomedCLIP and GastroNet. This model incorporates an Efficient Token Mixer and a Squeeze-and-Excitation Convolutional Neck to enhance feature integration and overall performance. The model was evaluated using three datasets, including two public datasets (Kvasir and Kvasir-Capsule) and one real-world dataset (DUMC). The analysis was conducted under patient-level splitting protocol.
Results
AToM consistently outperformed conventional deep learning, transformers, and recent foundation models across all evaluation metrics. On the Kvasir-Capsule dataset, AToM achieved an F1-score of 87.4 %, an accuracy of 88.7 %, and MCC of 0.616, outperforming the latest foundation model (DINOv2) by 1.7 % in F1-score, 1.6 % in accuracy, and 0.08 in MCC (p < 0.05 for all). Interestingly, under image-level splitting, the AToM model reached an accuracy of 98.6% and an MCC of 0.987 with highest performance, showing the highest performance.
Conclusions
The AToM model demonstrated the highest accuracy and strong generalization across different datasets compared with conventional models. Although overall accuracy decreased under the strict patient-level compared with the image-level split, the AToM model consistently maintained the best performance. These findings suggest that foundation model could be feasibly implemented in real-world clinical setting for capsule endoscopy.