Bowel preparation quality is a key determinant of colonoscopy effectiveness, influencing mucosal assessment and lesion detection. The Boston Bowel Preparation Scale (BBPS) is a widely used scoring system, yet its categorical and subjective nature limits granularity and consistency. This study aims to develop and validate a deep learning model for automated, objective, and continuous assessment of bowel cleanliness using real-time image analysis.
A retrospective dataset of colonoscopy videos from nine international centres (n=1128; 2020–2022; white light imaging) was used to train a DeepLabV3+ model¹ for seven-class scene segmentation (stool, water, clean mucosa, instruments, borders, shadows, glare). For validation, the trained model was applied to an independent single-centre dataset of colonoscopy videos (n=428; 2023–2024; white light and narrow band imaging). The model processed these videos by extracting one frame per second and generating a segmentation output for each extracted frame. From these outputs, we derived AI cleanliness metrics: frame-wise fractions of stool, water, and clean mucosa, and a novel Normalised Stool Index (NSI), defined as the proportion of the total mucosa labelled as stool, which is less biased by artifacts than the stool fraction. All metrics were computed only from instrument-free withdrawal frames, matching BBPS scoring conventions, and averaged per procedure. Associations between total BBPS scores from two raters (R1, R2; quadratic weighted κ=0.22; inter-rater ρ=0.221; weak inter-rater reliability) and mean stool, water, and clean mucosal fractions, as well as NSI, were assessed (Spearman’s ρ with false-discovery-rate correction). Clinical relevance was examined by correlating mean NSI, water fraction, and ΔNSI (difference between withdrawal and insertion) with lesion count and minimum lesion size, and by assessing the ability of the AI cleanliness metrics to predict the presence of at least one adenoma (AUC and logistic regression). Expected behaviour was assessed by examining NSI changes between scope insertion and withdrawal and around automatically detected rinsing events (paired Wilcoxon signed-rank test).
The DeepLabV3+ model achieved strong segmentation performance (sensitivity 92.7%, specificity 98.1%, mean Dice 93.8%), enabling reliable extraction of the AI cleanliness metrics. Withdrawal NSI showed the strongest association with BBPS, with significant correlations for both raters (R1: ρ=–0.21; R2: ρ=–0.12; both p<0.02). Stool fraction showed a similar pattern (R1: ρ=–0.19; R2: ρ=–0.12; both p<0.02). Water fraction correlated positively with BBPS for R2 only (ρ=0.17; p<0.001), and clean mucosal fraction showed no significant association. Cleanliness metrics showed no significant correlations with lesion count or minimum lesion size (all |ρ|<0.12; p>0.10), and showed no discriminative ability for adenoma presence prediction (AUC≈0.5). Logistic regression yielded non-significant odds ratios (OR) for all tested metrics (ORs 0.92–1.12; p>0.2). NSI decreased significantly after rinsing events (mean ΔNSI=-0.77%, p<0.001) and improved between insertion and withdrawal (mean ΔNSI=-1.59%, p<0.05).
The segmentation model performed strongly across all classes, enabling reliable extraction of the AI cleanliness metrics. NSI had the strongest association with BBPS, weaker trends were demonstrated for stool and water fractions. These modest correlations are expected given BBPS’s limited inter-rater agreement and its selective, segment-based scoring, which does not account for the effects of rinsing. Clinical outcome analysis was constrained by the cohort’s high adequacy rate (98%) and the use of procedure-level averages, which cannot reflect visibility at specific lesion sites. The AI cleanliness metrics behaved as expected, with NSI improving from insertion to withdrawal and decreasing in response to rinsing.
Together, these findings show that the novel AI system provides a continuous and detailed quantification of bowel cleanliness, capturing real-time changes and offering scene context that may support automated reporting and enhance downstream tools such as polyp detection. Further work in procedures with greater variability in preparation quality is needed, but the present results establish a strong foundation for more objective, reproducible, and clinically meaningful cleanliness assessment.