Aims
This study assesses the effectiveness of large language models (LLMs) in generating lay summaries for patient education on the management of precancerous lesions and early neoplasia in the stomach.
Methods
In this pilot study, we used a two-period, crossover, blinded design to compare two summary versions: a ChatGPT-4o summary and a summary from Digestive Cancers Europe (DiCE). Two panels participated in the ratings: an expert physician panel and a patient panel composed of members of the DiCE Patient Advisory Committee (PAC). Experts rated accuracy (6-point), completeness (5-point), comprehensibility (5-point), and satisfaction (5-point) across five sections of the summary. Patients rated overall summary completeness, comprehensibility, and satisfaction. The paired design allowed all raters to serve as their own control. Results are reported as medians with ranges and interquartile ranges (IQRs), and p values for the comparison between the two summaries were derived from the mixed-effects estimates. Objective readability score was calculated with Flesch–Kincaid Grade Level (FKGL) and SMOG Index.
Results
Median expert ratings were similar between materials across metrics: For the overall summary, median (range; IQR) scores were: accuracy 5 (4–6; 0.75) for ChatGPT-4o vs 5 (3–6; 1) for DiCE with p=0.102; completeness 4 (3–5; 1) vs 4 (2–5; 1) with p=0.272; comprehensibility 4 (3–5; 1) vs 4 (2–5; 1), with p=0.329; and satisfaction 4 (2–5; 1) vs 3 (1–5; 2) with p=0.329. Patient ratings mirrored experts, with very similar results on completeness, comprehensibility, or satisfaction. Readability failed to meet guideline recommendations for both summaries with both FKGL and SMOG score; DiCE read lower than ChatGPT-4o (FKGL: 11.16 vs 12.92; SMOG: 12.96 vs 14.84).
Conclusions
ChatGPT-4o produced patient materials comparable to DiCE, but both require readability optimization; a human-in-the-loop workflow and future tests across prompts and models are warranted.