Performance of GPT-5 in the Interpretation of IBD Colonoscopy and Histopathology Reports

Return

This media is currently not available.

M. Maida

A. Vitello

F. Macaluso

M. Daperno

G. Mocci

A. Rispo

G. Calabrese

N. Decarli

L. Laschi

C. Fattorini

G. Locci

R. Del sordo

D. Ligresti

M. Tacelli

M. Furnari

S. Sferrazza

G. Marasco

A. Facciorusso

O. Ambrogio

V. Villanacci

Poster Abstract

Aims

Histopathological interpretation is crucial for diagnosing inflammatory bowel disease (IBD), distinguishing between Crohn’s Disease (CD), Ulcerative Colitis (UC), IBD-Unclassified (IBD-U), and Non-IBD colitis (NIBDC). However, interobserver variability and limited expertise can reduce diagnostic accuracy. Accurate evaluation of both endoscopic and histological reports is essential for optimal IBD diagnosis. Large Language Models (LLMs) such as GPT-5 may offer clinical support in interpreting combined endoscopic–histological reports.

Methods

We analyzed 100 real-life combined endoscopic and histological reports from ileo-colonoscopies, equally representing CD, UC, IBD-U, and NIBDC, collected across five Italian healthcare centers, including both IBD-specialized and non-specialized hospitals. A reference standard was established by an expert pathologist. Independent classifications were generated by GPT-5, five gastrointestinal pathologists, five IBD-expert gastroenterologists (GIs), and five non-expert GIs. Diagnostic performance (accuracy, recall, precision, F1-score), agreement with the reference standard (Cohen’s κ), and inter-rater reliability (Fleiss’ κ) were assessed.

Results

GPT-5 achieved the highest agreement with the reference standard with the highest accuracy (76.0%), compared to pathologists (68.6%), IBD-experts (69.2%), and non-experts (63.2%). Agreement with the reference standard was substantial for GPT-5 (κ=0.671) and moderate for human groups (κ=0.508–0.588). GPT-5 showed perfect recall for CD and UC, high recall for NIBDC (96.0%), but poor performance for IBD-U (recall 8.0%, F1-score 14.3%) (Table 1). Fleiss’ κ indicated moderate agreement among pathologists and IBD-experts, and fair agreement among non-experts.

Table 1. Comparative performance metrics of pathologists, IBD-expert GIs, non-expert GIs, and GPT-5 compared with the reference standard.

Metric

Pathologists

IBD-expert GIs

IBD non-expert GIs

GPT-5

Accuracy (%)

68.6 (64.4–72.5)

69.2 (65.0–73.1)

63.2 (58.9–67.3)

76.0 (66.8–83.3)

Recall (%)

-IBD-U

-CD

-UC

-NIBDC

33.6 (25.9–42.3)

80.0 (72.1–86.1)

84.8 (77.5–90.0)

76.0 (67.8–82.6)

30.4 (23.0–38.9)

86.4 (79.3–91.3)

81.6 (73.9–87.4)

78.4 (70.4–84.7)

32.8 (25.2–41.4)

68.0 (59.4–75.5)

76.0 (67.8–82.6)

8.0 (2.2–25.0)

100.0 (86.7–100.0)

96.0 (80.5–99.3)

Precision (%)

-IBD-U

-CD

-UC

-NIBDC

56.0 (44.7–66.7)

83.3 (75.7–88.9)

60.6 (53.2–67.5)

73.1 (64.9–80.0)

52.1 (40.8–63.1)

76.6 (69.0–82.8)

72.3 (64.4–79.1)

67.6 (59.6–74.7)

42.3 (32.9–52.2)

83.3 (74.9–89.3)

57.9 (50.3–65.2)

69.3 (61.2–76.4)

66.7 (20.8–93.9)

67.6 (51.5–80.4)

80.6 (63.7–90.8)

82.8 (65.5–92.4)

F1-score (%)

-IBD-U

-CD

-UC

-NIBDC

42.6 (33.1–50.0)

81.6 (75.7–86.6)

70.0 (64.5–76.4)

74.5 (68.5–80.0)

38.5 (28.6–46.5)

81.3 (76.0–85.9)

76.7 (70.9–82.1)

72.8 (66.4–78.0)

36.9 (28.9–44.7)

75.0 (68.0–81.3)

65.9 (59.5–71.6)

72.5 (66.1–78.7)

14.3 (0.0–33.3)

80.6 (67.9–90.5)

89.3 (80.0–96.7)

89.0 (77.8–96.4)

Conclusions

GPT-5 demonstrated reliable performance in interpreting endoscopic and histological IBD histological reports, exhibiting high accuracy and strong agreement with the reference standard. While unreliable for IBD-U, GPT-5 may serve as a supportive tool in diagnosis and classification of IBD, particularly in centers with limited access to expert pathologists or IBD-specialists.

Download the app

The congress at your fingertips

Aims

Methods

Results

Conclusions