Aims
Video capsule endoscopy (VCE) produces tens of thousands of frames per examination, making manual annotation a major limitation for artificial intelligence (AI) development. Existing AI systems rely on frame-level labels, which are costly, time-consuming, and constrain dataset size. We aimed to evaluate whether video-level supervision using a multiple-instance learning (MIL) approach can reliably detect small-bowel bleeding in VCE. We hypothesised that assigning only a single pathology label per video is sufficient to achieve clinically meaningful performance and may allow the creation of substantially larger training datasets compared with frame-level annotation.
Methods
We collected capsule endoscopy recordings from two centres (University Hospital Dresden and Diakonissen Hospital Dresden) from 2011–2025. All available videos with clearly interpretable findings were included while truncated or incomplete studies were excluded. A medical professional assigned binary video-level labels (“bleeding present” or “absent”). De-identification, text removal, frame extraction and restriction to the small-bowel segment were performed.For model development, all frames were encoded using a pretrained vision transformer to obtain 768-dimensional embeddings. These were aggregated into video-level bags and processed using an attention-based MIL customised classifier customised. Five-fold cross-validation was used throughout training, with 20% of each fold serving as an internal validation split. Class imbalance was handled through weighted loss.Internal validation was performed across PillCam and Olympus systems. External validation was conducted on an independent clinical dataset that included a new capsule system (NaviCam) as well as a dataset of the same capsule type obtained from another institution.
Results
A total of 595 videos across multiple capsule systems were available for training and internal validation, with additional datasets reserved for external testing. After preprocessing, all videos were successfully converted into de-identified small-bowel segments.
The attention-based MIL model achieved an AUROC of 0.85 and an accuracy of 0.80 in internal validation. Accuracy represents the proportion of correctly classified videos for the binary task “bleeding” versus “no bleeding”. No threshold tuning was performed yet. Instead, accuracy was calculated by binarizing predicted probabilities at a threshold of 0.5. External validation showed lower performance, with AUROC values of 0.69 for NaviCam videos and 0.75 for the Diakonissen dataset. These results are considered preliminary, as both datasets are currently undergoing optimisation steps, including improved preprocessing and re-evaluation of video-level labels.
Table 1. - Model performance for bleeding detection in capsule endoscopy. Internal validation uses five-fold cross-validation; external validation includes NaviCam dataset (different capsule type but same hospital) and Diakonissen (extern hospital but same capsule type).
|
Internal Validation (5-fold cross validation) |
External Validation |
||||||||
|
fold 1 |
fold 2 |
fold 3 |
fold 4 |
fold 5 |
Avg. |
NaviCam |
Diak |
Total |
|
|
AUROC |
0.91 |
0.82 |
0.85 |
0.86 |
0.79 |
0.85 |
0.69 |
0.75 |
0.75 |
|
Accuracy |
0.84 |
0.84 |
0.78 |
0.77 |
0.80 |
0.80 |
0.70 |
0.83 |
0.80 |
Conclusions
Video-level supervision combined with multiple-instance learning enables accurate bleeding detection in capsule endoscopy without the need for frame-level annotation. This approach substantially reduces annotation workload, facilitates the creation of larger multicentre training cohorts and may improve performance scalability for future VCE AI systems. While external validation results indicate room for optimisation, they also highlight the feasibility of generalising video-level MIL models across institutions and capsule types. MIL-based video-level labelling represents a promising pathway toward clinically deployable, resource-efficient AI tools in small-bowel diagnostics and may be extended to additional VCE pathologies.