Aims
Accurately assessing technical difficulty in acute upper gastrointestinal bleeding (AUGIB) endoscopy is essential for training, triage, and service planning. The S2T2 scoring tool integrates four domains: Setting, Site, Type, and Treatment, to generate a total score between 0 and 10, stratified into easy, moderate, and difficult cases. This study evaluated the inter- and intra-observer reliability of the S2T2 tool among endoscopy trainees and trainers, and explored its relationship with perceived educational value (EDV).
Methods
Twenty-three raters (12 trainees, 11 trainers) scored 15 anonymised AUGIB cases (5 easy, 5 moderate, and 5 difficult) across two rounds two weeks apart. Case order was randomised and blinded between rounds. Each rater independently applied the S2T2 score to each case. Inter- and intra-observer reliability of total S2T2 scores was assessed using intraclass correlation coefficients (ICCs; two-way, agreement, single measures), stratified by case difficulty and by rater type. Domain-level ICCs were calculated for each S2T2 component. Intra-observer agreement between rounds was evaluated using a Bland–Altman plot and summarised using mean absolute score differences. The association between S2T2 scores and perceived EDV was analysed using Spearman’s rank correlation.
Results
Overall inter-observer reliability was excellent, with an ICC of 0.938 (95% CI: 0.888–0.974) in Round 1 and 0.919 (95% CI: 0.856–0.966) in Round 2. Stratified analyses showed strong agreement for difficult cases (ICC > 0.8) and easy cases (ICC ~0.72–0.86), while moderate cases showed greater variability, particularly among trainees (ICC 0.41 vs 0.25). Intra-observer reliability across rounds was also strong (ICC = 0.938, 95% CI: 0.924–0.950), with a mean absolute score difference of 0.51 (SD 0.99). The Bland–Altman plot demonstrated minimal bias and narrow limits of agreement. Domain-specific reliability was highest for Treatment (ICC = 0.985), followed by Setting (0.910), Type (0.890), and Site (0.657). EDV ratings demonstrated substantial inter-rater agreement (Fleiss’ κ = 0.757), and S2T2 scores showed a strong positive correlation with EDV (Spearman’s ρ = 0.81, p < 0.001).
Conclusions
The S2T2 tool is a reliable and valid measure of technical difficulty in AUGIB endoscopy, demonstrating excellent inter- and intra-observer agreement, particularly in clearly easy or difficult cases. The strong correlation between S2T2 scores and perceived educational value supports its use for case selection in training, performance assessment and certification, and service benchmarking.