Presentation Description
Ankita Vayalapalli1
Mesk M Nafea1, Vivien Makhoul1 and Rodger D MacArthur1,2
1 Office of Academic Affairs, Medical College of Georgia at Augusta University, Augusta, Georgia U.S.
2 Division of Infectious Diseases, Medical College of Georgia at Augusta University, Augusta, Georgia U.S.
Mesk M Nafea1, Vivien Makhoul1 and Rodger D MacArthur1,2
1 Office of Academic Affairs, Medical College of Georgia at Augusta University, Augusta, Georgia U.S.
2 Division of Infectious Diseases, Medical College of Georgia at Augusta University, Augusta, Georgia U.S.
The popularity of Artificial Intelligence programs like ChatGPTv3.5 implores the need for a better understanding of its uses in medical training. We assessed the accuracy and reliability of ChatGPT on a standardized United States Medical Licensing Exam Step 1. While others have assessed the accuracy of ChatGPT, this work is the first of its kind to assess both accuracy and reliability.
The dataset was obtained from the 2013-2014 Free120, a USMLE Step 1 practice exam of 120 multiple choice questions by the National Board of Medical Examiners. To ensure reliability, we conducted three runs on the same dataset of 120 questions. In each run, we asked ChatGPT the full set of questions and recorded the responses. For accuracy, we compared answer outputs from ChatGPT to the answer key.
ChatGPT answered with an accuracy of 77.6%. A chi squared analysis yielded Xsquared(2, N= 109)=3.33, p =0.77. ChatGPT performed most accurately on pathology questions (81%) and least accurately on ethics questions (61%). Across the three trials of the same questions, ChatGPT changed answers 31% of the time.
With the rapid emergence and utilization of ChatGPT amongst medical students to prepare for assessments, it is important to understand its accuracy and reliability. Our results suggest that ChatGPT is lacking in both. Compared to other study tools, ChatGPT grossly falls short of even near-perfect accuracy. Most alarming is the lack of reliability, as ChatGPT failed to remain consistent between trials.
Despite ChatGPT achieving a “Pass” on USMLE Step 1, medical students must be aware of its deficits in accuracy and reliability. Inconsistency between trials indicates that the technology is not only inaccurate, but also inconsistently inaccurate. Invariably, Artificial Intelligence is garnering immense traction and it is of utmost importance to learn how to navigate such technologies by demystifying major limitations of programs like ChatGPT.
References (maximum three)
1. Biswas, S. (2023), ChatGPT and the Future of Medical Writing. Radiology. https://pubs.rsna.org/doi/10.1148/radiol.223312
2. Johnson, D., et al. (2023), Assessing the Accuracy and Reliability of AI-Generated Medical Responses: An Evaluation of the Chat-GPT Model
3. Lund, B.D. and Wang, T. (2023), "Chatting about ChatGPT: how may AI and GPT impact academia and libraries?", Library Hi Tech News, Vol. 40 No. 3, pp. 26-29.