This media is currently not available.
Central reading reveals high inter-rater variability in Boston Bowel Preparation Scale scoring: results from a prospective randomised trial
Poster Abstract

Aims

The Boston Bowel Preparation Scale (BBPS) is a widely used, validated tool to assess bowel cleanliness, but its reliability is limited by inter-observer variability among endoscopists. Central reading can reduce variability and improve the consistency in clinical trials.

Methods

We prospectively included 494 patients scheduled for colonoscopy in UZ Leuven (Leuven, Belgium), to assess the impact of a low-fiber dinner the evening prior to colonoscopy. Participants were randomized (1:1) to either a standard regimen or a more lenient regime allowing a low-fiber dinner up to 2 hours before start of the bowel preparation procedure. In this analysis, we aim to compare the assessment of adequate (>6) and optimal BBPS (8-9) by the endoscopist and an independent central reader who was not involved in the procedure. Comparison of BBPS scoring performance using Fisher’s exact test, inter-observer agreement was calculated using weighted Cohen’s k-statistics.

Results

By endoscopists scoring, adequate BBPS scoring was achieved in 99.0% (489/494) and optimal bowel preparation was achieved in 98.0% (484/494) of the cases.

After central reading, both scorings for adequate and optimal bowel preparation decreased significantly to 86.4.0% (427/494) and 64.6% (319/494), respectively (p<0.00001).

On the 494 BBPS scorings, 217 (43.9%) were agreed upon after central reading (k=0.154, 95%CI 0.044-0.244). Based on endoscopist scoring 4 patients had right-sided BBPS 0-1 (1 control and 3 in the interventional group), but only 2 were equally scored after central reading (1 was total BBPS 2 and 1 total BBPS 3). Overall, central readers scored 6 patients with a right-sided BBPS 0-1 (1.2%), whereas by endoscopist’s scoring only 2 were similarly scored (0.04%).

Conclusions

Validated bowel preparation scoring systems like BBPS are well-integrated though prone to high-interrater disagreement, up to 56.1% in this study, potentially affecting clinical trial outcomes and quality assessment. Objective and continuous quantification by artificial intelligence may overcome this weakness and improve future practice and research.