Advanced Ultrasound in Diagnosis and Therapy ›› 2024, Vol. 8 ›› Issue (4): 250-254.doi: 10.37015/AUDT.2024.240002

• Original Research • Previous Articles     Next Articles

Performance of ChatGPT and Radiology Residents on Ultrasonography Board-Style Questions

Xu Jialea,b,1, Xia Shujuna,b,1, Hua Qinga,b, Mei Zihana,b, Hou Yiqinga,b, Wei Minyana,b, Lai Limeia,b, Yang Yixuana,b, Zhou Jianqiaoa,b,*()   

  1. aDepartment of Ultrasound, Ruijin Hospital, Shanghai Jiaotong University School of Medicine, Shanghai, China
    bCollege of Health Science and Technology, Shanghai Jiao Tong University School of Medicine, Shanghai, China
  • Received:2024-01-19 Accepted:2024-03-18 Online:2024-12-30 Published:2024-11-12
  • Contact: Zhou Jianqiao, E-mail:zhousu30@126.com
  • About author:First author contact:

    1 Jiale Xu and ShuJun Xia contributed equally to this study.

Abstract:

Objective: This study aims to assess the performance of the Chat Generative Pre-Trained Transformer (ChatGPT), specifically versions GPT-3.5 and GPT-4, on ultrasonography board-style questions, and subsequently compare it with the performance of third-year radiology residents on the identical set of questions.
Methods: The study, conducted from May 19 to May 30, 2023, utilized a selection of 134 multiple-choice questions sourced from a commercial question bank for American Registry for Diagnostic Medical Sonography (ARDMS) examinations and imported into the ChatGPT model (encompassing GPT-3.5 and GPT-4 versions). ChatGPT’s responses were evaluated overall, by topic, and by GPT version. An identical question set was assigned to three third-year radiology residents, enabling a direct comparison of performances with ChatGPT.
Results: GPT-4 correctly responded to 82.1% of questions (110 of 134), significantly surpassing the performance of GPT-3.5 (P = 0.003), which correctly answered 66.4% of questions (89 of 134). Although GPT-3.5’s performance was statistically indistinguishable from the average performance of the radiology residents (66.7%, 89.3 of 134) (P = 0.969), there was a notable difference in the accuracy in question-answering accuracy between GPT-4 and the residents (P = 0.004).
Conclusions: ChatGPT demonstrated significant competency in responding to ultrasonography board-style questions, with the GPT-4 version markedly surpassing both its predecessor GPT-3.5 and the radiology residents.

Key words: Artificial intelligence; Ultrasonography; Accuracy; Medical education