Aims
The ESGE STAR Project provides a 55-item standardized framework for upper GI endoscopy reporting, based on expert consensus due to the absence of high-quality evidence. The ability of artificial intelligence (AI) systems to apply these structured criteria remains unknown. This study compares the adherence of ChatGPT-Audio, using voice-dictation input, to that of expert endoscopists when both are instructed to follow the STAR recommendations.
Methods
A total of 104 EGD reports were analyzed: 52 generated by ChatGPT 5.1 and 52 written by expert endoscopists, all under explicit instruction to follow the 55-item ESGE STAR checklist. Two blinded senior reviewers independently evaluated each report for item-level compliance across pre-, intra- and post-procedural domains. Primary outcome: overall compliance rate. Secondary outcomes: domain-specific compliance, number of missing items, inter-rater reliability (κ), and ROC analysis.
Results
Overall compliance was 87.6% ± 4.8 for ChatGPT-Audio and 93.1% ± 3.5 for experts (p = 0.002). ChatGPT-Audio missed 6.8 ± 3.2 items per report compared with 3.7 ± 2.1 for experts. Domain-specific compliance was:• Pre-procedural: 81% vs 94%• Intra-procedural: 92% vs 95%• Classification usage (LA, Prague, Forrest, Paris): 89% vs 94%• Post-procedural: 76% vs 90%Inter-rater agreement was excellent (κ = 0.84). ROC analysis identified a threshold of ≥90% compliance (AUC = 0.86).
Conclusions
When explicitly instructed to apply the ESGE STAR criteria, ChatGPT-Audio produced highly structured and guideline-conformant EGD reports, approaching expert performance. Human endoscopists nonetheless demonstrated superior compliance, particularly regarding pre- and post-procedural elements. This study provides the first controlled assessment of an AI voice-dictation system’s ability to apply the ESGE STAR reporting standard and supports its potential role in advancing standardized endoscopy documentation.