COLON-AI: Evaluating the Reliability and Readability of AI Chatbots in Colonoscopy Counseling

Return

This media is currently not available.

L. Antuofermo

P. Apolito

S. Landi

R. Schiumerini

G. Tamanini

N. Longo

R. Librizzi

S. Ghersi

M. Bassi

V. Cennamo

Poster Abstract

Aims

Adequate bowel preparation critically depends on the delivery of clear and comprehensible pre-procedural information. Although a recent study has assessed the effectiveness of AI chatbots (GoogleGemini and ChatGPT) in providing colonoscopy instructions, comparative evidence including the most widely disseminated chatbot, Meta, is still lacking, and no data are available concerning the readability of these systems. The objective of the present study was to systematically evaluate the performance and readability of the most widely accessible AI chatbots in generating patient-facing colonoscopy preparation instructions.

Methods

We conducted a prospective, single-center observational study at Maggiore Hospital Bologna between August and September 2025, in accordance with the METRICS checklist. Frequently asked questions (FAQ) regarding colonoscopy were collected from Italian hospital websites and Quora, and subsequently submitted to MetaAI, ChatGPT, and Google Gemini. The chatbot-generated responses were independently evaluated by five gastroenterologists, who were blinded to the chatbot platform, using a 1–5 Likert scale to assess accuracy, completeness, clarity, evidence-based content, and the absence of irrelevant information. Text readability was assessed using Readability Formulas® (FRES score), expressed on a 0–100 scale.In addition, a subgroup performance analysis was conducted by categorizing questions into pre-procedural (e.g., split-dose bowel preparation), intra-procedural (e.g., how polypectomy is performed), and post-procedural aspects (e.g., post-sedation precautions).

Results

GoogleGemini achieved the highest overall performance, with mean scores for accuracy (3.14), completeness (3.12), clarity (3.13), evidence-based content (2.91) and relevance (2.99), outperforming ChatGPT and MetaAI. The most satisfactory results were observed for pre-procedural questions compared to intra- and post-procedural ones, especially for accuracy, completeness and relevance. ChatGPT ranked second, with slightly lower yet consistent values for accuracy (2.96), completeness (2.93), clarity (2.95), evidence-based content (2.80) and relevance (2.93), particularly in delivering accurate intraprocedural responses. Conversely, MetaAI showed the weakest performance, with significantly lower scores for accuracy (2.62), completeness (2.56), clarity (2.62), evidence-based content (2.60) and relevance (2.66). Regarding readability, ChatGPT (13.68) generated more understandable texts than GoogleGemini (6.35) and MetaAI (4.21). Post-procedural questions yielded more accessible answers with ChatGPT showing the best readability scores (17.17), followed by GoogleGemini (12.00) and MetaAI (6.50). In contrast, intra-procedural queries produced the least satisfactory results with ChatGPT, GoogleGemini, and MetaAI scoring 11.00, 1.86 and 2.71, respectively. Overall, GoogleGemini provided the best response quality, ChatGPT slightly lower performance but better readability and MetaAI globally less favorable results.

	ChatGPT					MetaAI					Gemini
	Accuracy	Completeness	Clarity	Evidence-based content	Relevance	Accuracy	Completeness	Clarity	Evidence-based content	Relevance	Accuracy	Completeness	Clarity	Evidence-based content	Relevance
Pre-procedure	2,96	2,88	2,99	2,79	2,97	2,49	2,37	2,49	2,41	2,59	3,22	3,18	3,18	2,94	3,1
Intra-procedure	3	3,05	3	2,79	2,95	2,67	2,62	2,67	2,64	2,64	3,12	3,17	3,19	2,98	2,93
Post-procedure	2,92	2,86	2,86	2,83	2,86	2,69	2,69	2,69	2,75	2,75	3,08	3,03	3,03	2,81	2,94
Overall	2,96	2,93	2,95	2,8	2,93	2,62	2,56	2,62	2,6	2,66	3,14	3,12	3,13	2,91	2,99

Conclusions

This is the first study to evaluate both the accuracy and readability of colonoscopy-related information generated by the most widely used chatbots. Further studies involving patient participation are warranted to confirm the clinical applicability of these models in real-life settings.

Download the app

The congress at your fingertips

Aims

Methods

Results

Conclusions