Presentation Description
Leo Morjaria1
Levi Burns1, Keyna Bracken1,2, Quang Ngo1,2, Mark Lee2 and Matthew Sibbald1,2
1 Michael G. DeGroote School of Medicine, McMaster University
2 McMaster Education Research, Innovation and Theory (MERIT) Program
Levi Burns1, Keyna Bracken1,2, Quang Ngo1,2, Mark Lee2 and Matthew Sibbald1,2
1 Michael G. DeGroote School of Medicine, McMaster University
2 McMaster Education Research, Innovation and Theory (MERIT) Program
Background:
Following ChatGPT’s launch [1], it achieved passing grades on question subsets of standardized medical licensing exams, such as the USMLE [2,3]. In this way, ChatGPT introduces a threat to the validity of medical student assessment. This study evaluates the extent of this threat to short-answer assessment problems that are used as important learning benchmarks for pre-clerkship students.
Following ChatGPT’s launch [1], it achieved passing grades on question subsets of standardized medical licensing exams, such as the USMLE [2,3]. In this way, ChatGPT introduces a threat to the validity of medical student assessment. This study evaluates the extent of this threat to short-answer assessment problems that are used as important learning benchmarks for pre-clerkship students.
Summary of work:
Using 40 problems from past student assessments, 30 responses were generated by ChatGPT, and 10 minimally passing responses were drawn from past students. Problems were selected to encompass both lower and higher-order cognitive domains. Minimally passing responses were chosen as they reflect the standard of competency expected of students in the program. Six experienced tutors graded all 40 responses. Standard statistical techniques were applied to compare performance between student-generated and ChatGPT- generated answers. ChatGPT performance was also compared with historical student averages at our institution.
Using 40 problems from past student assessments, 30 responses were generated by ChatGPT, and 10 minimally passing responses were drawn from past students. Problems were selected to encompass both lower and higher-order cognitive domains. Minimally passing responses were chosen as they reflect the standard of competency expected of students in the program. Six experienced tutors graded all 40 responses. Standard statistical techniques were applied to compare performance between student-generated and ChatGPT- generated answers. ChatGPT performance was also compared with historical student averages at our institution.
Results:
ChatGPT-generated responses received a score of 3.29 out of 5 (n=30, 95% CI 2.93- 3.65) compared to 2.38 for minimally passing students (n=10, 95% CI 1.94-2.82), representing stronger performance (p=0.008, η2=0.169), but was outperformed by historical class averages (mean 3.67, p=0.018) when including all past responses regardless of student performance level. There was no significant trend in performance across domains of Bloom’s Taxonomy.
Discussion:
ChatGPT was effective in answering short-answer assessment problems across the pre-clerkship curriculum. Human assessors were often unable to distinguish between responses generated by ChatGPT and those produced by students.
ChatGPT was effective in answering short-answer assessment problems across the pre-clerkship curriculum. Human assessors were often unable to distinguish between responses generated by ChatGPT and those produced by students.
Conclusions:
While ChatGPT was able to reach passing grades in our short-answer assessments, it outperformed only underperforming students, and failed to outperform the historical class average.
While ChatGPT was able to reach passing grades in our short-answer assessments, it outperformed only underperforming students, and failed to outperform the historical class average.
Take-Home Messages/Implications:
Risks to assessment validity include uncertainty in identifying struggling students. Areas of future research include: ChatGPT's performance in higher cognitive tasks, its role as a learning tool, and its potential in evaluating assessments.
Risks to assessment validity include uncertainty in identifying struggling students. Areas of future research include: ChatGPT's performance in higher cognitive tasks, its role as a learning tool, and its potential in evaluating assessments.
References (maximum three)
[1] Sallam M. ChatGPT Utility in Healthcare Education, Research, and Practice: Systematic Review on the Promising Perspectives and Valid Concerns. Healthcare (Basel). 2023;11(6):887. doi:10.3390/healthcare11060887
[2] Kung TH, Cheatham M, Medenilla A, et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023;2(2):e0000198. doi:10.1371/journal.pdig.0000198
[3] Gilson A, Safranek CW, Huang T, et al. How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment. JMIR Med Educ. 2023;9:e45312. doi:10.2196/45312