Performance of Large Language Models in Addressing Patient Queries on Colorectal Cancer Screening in Different Languages: an international study across 28 countries

This media is currently not available.

M. Maida

A. Papaefthymiou

S. Gupta

T. Voiosu

L. Lau

S. Baraldo

P. Pal

M. Mwachiro

T. Zuchelli

H. Uchima

E. Aguila

D. Bouberra

H. Degroote

T. Düzenli

A. Gameel

T. Khurelbaatar

S. Lakkasani

B. Luvsandagva

H. Maulahela

R. Nobre moura

Y. Okubo

A. Rimondi

I. Mostafa

Poster Abstract

Aims

Colorectal cancer (CRC) screening reduces incidence and mortality, yet participation rates remain suboptimal worldwide. Large language models (LLMs), such as ChatGPT, may help overcome communication barriers by providing accessible, multilingual information. However, their performance across different languages and cultural contexts has never been systematically evaluated. This study aims to assess ChatGPT’s accuracy, completeness, and comprehensibility in answering colorectal cancer screening questions across 23 languages and 28 countries, and to evaluate cross-linguistic variability in performance.

Methods

Between April and June 2025, we conducted a cross-continental study involving 28 countries and 23 languages. A standardized set of 15 CRC screening–related questions was manually translated by native researchers to preserve linguistic accuracy and cultural nuances. Translated questions were submitted to ChatGPT (GPT-4o), and the generated responses were independently assessed by 140 senior gastroenterologists (five per country). Each response was rated for accuracy, completeness, and comprehensibility using a 5-point Likert scale. Statistical analyses included t-tests, Chi-square tests, and two-way ANOVA.

Results

The study collected 2,100 expert ratings across six continents. Overall mean scores (±SD) were 4.1 ± 1.0 for accuracy, 4.1 ± 1.0 for completeness, and 4.2 ± 0.9 for comprehensibility. High-quality performance (score ≥4) was observed in 73.9%, 86.9%, and 82.6% of languages for the three domains, respectively. Italian, Turkish, Swahili, and Japanese achieved the highest ratings, while Traditional Chinese, Dutch, and Greek consistently showed lower performance across all domains. Significant differences between languages were found for each metric (P < 0.001).

Subtle intra-language variability was also observed in languages spoken in multiple countries (e.g., Dutch in Belgium vs. the Netherlands; English in the UK, US, and Australia), although without statistically significant differences. Question-level analysis revealed a broad score dispersion across languages, with most ratings ranging between 3 and 5.

Conclusions

ChatGPT showed strong ability to answer CRC screening questions across multiple languages, supporting its promise as a multilingual patient education tool. However, a significant variability in performance across different languages was demonstrated, highlighting the need for language-specific validation prior to widespread clinical implementation.

Download the app

The congress at your fingertips

Aims

Methods

Results

Conclusions