Aims
Histopathological interpretation is crucial for diagnosing inflammatory bowel disease (IBD), distinguishing between Crohn’s Disease (CD), Ulcerative Colitis (UC), IBD-Unclassified (IBD-U), and Non-IBD colitis (NIBDC). However, interobserver variability and limited expertise can reduce diagnostic accuracy. Accurate evaluation of both endoscopic and histological reports is essential for optimal IBD diagnosis. Large Language Models (LLMs) such as GPT-5 may offer clinical support in interpreting combined endoscopic–histological reports.
Methods
We analyzed 100 real-life combined endoscopic and histological reports from ileo-colonoscopies, equally representing CD, UC, IBD-U, and NIBDC, collected across five Italian healthcare centers, including both IBD-specialized and non-specialized hospitals. A reference standard was established by an expert pathologist. Independent classifications were generated by GPT-5, five gastrointestinal pathologists, five IBD-expert gastroenterologists (GIs), and five non-expert GIs. Diagnostic performance (accuracy, recall, precision, F1-score), agreement with the reference standard (Cohen’s κ), and inter-rater reliability (Fleiss’ κ) were assessed.
Results
GPT-5 achieved the highest agreement with the reference standard with the highest accuracy (76.0%), compared to pathologists (68.6%), IBD-experts (69.2%), and non-experts (63.2%). Agreement with the reference standard was substantial for GPT-5 (κ=0.671) and moderate for human groups (κ=0.508–0.588). GPT-5 showed perfect recall for CD and UC, high recall for NIBDC (96.0%), but poor performance for IBD-U (recall 8.0%, F1-score 14.3%) (Table 1). Fleiss’ κ indicated moderate agreement among pathologists and IBD-experts, and fair agreement among non-experts.
Table 1. Comparative performance metrics of pathologists, IBD-expert GIs, non-expert GIs, and GPT-5 compared with the reference standard.
|
Metric |
Pathologists |
IBD-expert GIs |
IBD non-expert GIs |
GPT-5
|
|
Accuracy (%)
|
68.6 (64.4–72.5) |
69.2 (65.0–73.1) |
63.2 (58.9–67.3) |
76.0 (66.8–83.3) |
|
Recall (%) -IBD-U -CD -UC -NIBDC
|
33.6 (25.9–42.3) 80.0 (72.1–86.1) 84.8 (77.5–90.0) 76.0 (67.8–82.6) |
30.4 (23.0–38.9) 86.4 (79.3–91.3) 81.6 (73.9–87.4) 78.4 (70.4–84.7) |
32.8 (25.2–41.4) 68.0 (59.4–75.5) 76.0 (67.8–82.6) 76.0 (67.8–82.6) |
8.0 (2.2–25.0) 100.0 (86.7–100.0) 100.0 (86.7–100.0) 96.0 (80.5–99.3) |
|
Precision (%) -IBD-U -CD -UC -NIBDC
|
56.0 (44.7–66.7) 83.3 (75.7–88.9) 60.6 (53.2–67.5) 73.1 (64.9–80.0) |
52.1 (40.8–63.1) 76.6 (69.0–82.8) 72.3 (64.4–79.1) 67.6 (59.6–74.7) |
42.3 (32.9–52.2) 83.3 (74.9–89.3) 57.9 (50.3–65.2) 69.3 (61.2–76.4) |
66.7 (20.8–93.9) 67.6 (51.5–80.4) 80.6 (63.7–90.8) 82.8 (65.5–92.4) |
|
F1-score (%) -IBD-U -CD -UC -NIBDC |
42.6 (33.1–50.0) 81.6 (75.7–86.6) 70.0 (64.5–76.4) 74.5 (68.5–80.0) |
38.5 (28.6–46.5) 81.3 (76.0–85.9) 76.7 (70.9–82.1) 72.8 (66.4–78.0) |
36.9 (28.9–44.7) 75.0 (68.0–81.3) 65.9 (59.5–71.6) 72.5 (66.1–78.7) |
14.3 (0.0–33.3) 80.6 (67.9–90.5) 89.3 (80.0–96.7) 89.0 (77.8–96.4) |
Conclusions
GPT-5 demonstrated reliable performance in interpreting endoscopic and histological IBD histological reports, exhibiting high accuracy and strong agreement with the reference standard. While unreliable for IBD-U, GPT-5 may serve as a supportive tool in diagnosis and classification of IBD, particularly in centers with limited access to expert pathologists or IBD-specialists.