Aims
Randomized controlled trials (RCTs) comparing computer-aided detection (CADe) systems with standard high-definition colonoscopy exhibit marked variability in adenoma detection rate (ADR) across control (non-AI) arms. Since baseline ADR directly determines the magnitude of the apparent benefit of CADe, understanding the sources of variability in human ADR is essential for interpreting AI-assisted performance. This study aimed to quantify the pooled ADR of control arms in CADe RCTs and to identify the study-level factors explaining its heterogeneity through meta-regression.
Methods
We included all control groups from RCTs comparing standard colonoscopy with any CADe platform. The primary outcome was pooled ADR, calculated using a random-effects model. Secondary outcomes included polyp detection rate (PDR) and adenoma miss rate (AMR). Heterogeneity was quantified using the inconsistency index (I²). Mixed-effects meta-regression explored associations between ADR and predefined moderators including patient age, sex, geographic region, study design, clinical setting, endoscopist experience, baseline ADR, and withdrawal time.
Results
Forty-two control arms were analyzed. The pooled ADR was 0.35 (95% CI, 0.30–0.41), with considerable heterogeneity (I² = 96.8%). The pooled PDR was 0.50 (95% CI, 0.44–0.55), while the pooled AMR was 0.37 (95% CI, 0.33–0.40), again with considerable heterogeneity (I² = 97.0%). Meta-regression demonstrated that age (p<0.001), male sex proportion (p=0.005), geographic region (p=0.009), clinical setting (p=0.001), baseline ADR (p=0.005), and withdrawal time (p<0.001) were the strongest predictors of variability in human ADR. In selected models, these covariates explained up to 89.7% of the between-study variance.
Table 1. Meta-regression analysis investigating study-level factors associated with ADR.
|
Covariates |
Number of studies |
Beta coefficient ± SE |
Adjusted R2 (%) |
P value |
|
Age |
42 |
1.017 ± 0.002 |
67.0 |
<0.001 |
|
Sex (male) |
42 |
1.006 ± 0.002 |
15.3 |
0.005 |
|
Region |
42 |
0.887 ± 0.039 |
20.9 |
0.009 |
|
Study design (Monocentric vs Multicentric) |
42 |
1.089 ± 0.050 |
7.0 |
0.070 |
|
Analysis (PP vs ITT) |
42 |
0.950 ± 0.048 |
-1.0 |
0.320 |
|
Setting (screening/surveillance vs diagnostic) |
41 |
0.865 ± 0.037 |
26.8 |
0.001 |
|
Screening population ≥50% |
38 |
1.035 ± 0.048 |
-0.6 |
0.466 |
|
Endoscopist training level |
40 |
1.012 ± 0.030 |
-3.3 |
0.698 |
|
Baseline ADR |
11 |
1.490 ± 0.163 |
89.7 |
0.005 |
|
Withdrawal time |
41 |
1.030 ± 0.005 |
61.2 |
<0.001 |
Abbreviations: SE=Standard Error; R2=Relative reduction in between-study variance: the value indicates the proportion of between study variance explained by covariate.
Conclusions
ADR in standard colonoscopy arms of CADe RCTs is highly heterogeneous, reflecting demographic, contextual, and procedural quality differences across studies. This extreme baseline variability substantially shapes the observed effect size of CADe systems, highlighting the need to contextualize AI performance against the intrinsic variability of human detection.