Times are shown in your local time zone GMT
Ad-blocker Detected - Your browser has an ad-blocker enabled, please disable it to ensure your attendance is not impacted, such as CPD tracking (if relevant). For technical help, contact Support.
Artifical intelligence / Large language models
Oral Presentation
Oral Presentation
4:00 pm
26 February 2024
M209
Session Program
4:00 pm
Sumaiya Adam1
Yvette Hlophe1 and Vanessa Burch2
1 University of Pretoria
2 Colleges of Medicine of South Africa
Yvette Hlophe1 and Vanessa Burch2
1 University of Pretoria
2 Colleges of Medicine of South Africa
Introduction:
CMSA aims to implement WBA for postgraduate medical training by 2025. EPAs define the knowledge, skills and attitudes expected from specialists, thereby facilitating training and assessment. Whilst WBA with an EPA framework has been implemented in various countries, there is scant data on experiences in resource-limited settings, with limited availability of HPE experts, who then depend on busy discipline-specific experts for development and implementation.
CMSA aims to implement WBA for postgraduate medical training by 2025. EPAs define the knowledge, skills and attitudes expected from specialists, thereby facilitating training and assessment. Whilst WBA with an EPA framework has been implemented in various countries, there is scant data on experiences in resource-limited settings, with limited availability of HPE experts, who then depend on busy discipline-specific experts for development and implementation.
Purpose:
To define EPAs for ObGyn postgraduate training, quantify the importance of each EPA, and explore the role of ChatGPT4 in defining competencies for each EPA.
To define EPAs for ObGyn postgraduate training, quantify the importance of each EPA, and explore the role of ChatGPT4 in defining competencies for each EPA.
Methodology:
Individual EPAs were defined using Backward Design to describe the Day 1 specialist competencies. Two rounds of modified Delphi and in-person workshop were performed to obtain consensus on core EPAs. The detailed competencies were defined per EPA and was agreed by further Delphi surveys. The use of ChatGPT4 was explored to define the core activities and writing detailed EPAs.
Individual EPAs were defined using Backward Design to describe the Day 1 specialist competencies. Two rounds of modified Delphi and in-person workshop were performed to obtain consensus on core EPAs. The detailed competencies were defined per EPA and was agreed by further Delphi surveys. The use of ChatGPT4 was explored to define the core activities and writing detailed EPAs.
Results:
Item analysis yielded 10 core EPAs of the 30 EPAs defined, with a survey response of 8/10 (80%) and 6/10 (60%) following the first two Delphis. The in-person workshop resulted in unanimous agreement that the 10 core EPAs were "absolutely essential", 15 were "moderately important" and 5 were "nice to have". Experts agreed that a stepwise increase in the level of competence was required dependent on the stage of training. The second set of Delphis had a response of 3/10 (30%) and 1/10 (1%). ChatGPT4 showed a positive correlation with the EPAs that defined the specialty, as well as the defined competencies for each of the selected EPAs.
Item analysis yielded 10 core EPAs of the 30 EPAs defined, with a survey response of 8/10 (80%) and 6/10 (60%) following the first two Delphis. The in-person workshop resulted in unanimous agreement that the 10 core EPAs were "absolutely essential", 15 were "moderately important" and 5 were "nice to have". Experts agreed that a stepwise increase in the level of competence was required dependent on the stage of training. The second set of Delphis had a response of 3/10 (30%) and 1/10 (1%). ChatGPT4 showed a positive correlation with the EPAs that defined the specialty, as well as the defined competencies for each of the selected EPAs.
Conclusion:
WBA requires EPAs and benchmarks for each stage of training. ChatGPT4, by harnessing available knowledge, is a viable tool to define EPAs for a specialty thereby expediting the process of implementation of WBA, especially in resource-rich and resource- limited environments.
WBA requires EPAs and benchmarks for each stage of training. ChatGPT4, by harnessing available knowledge, is a viable tool to define EPAs for a specialty thereby expediting the process of implementation of WBA, especially in resource-rich and resource- limited environments.
References (maximum three)
1. Caccia, N., Nakajima, A., Scheele, F., & Kent, N. (2015). Competency-Based Medical Education: Developing a Framework for Obstetrics and Gynaecology. Journal of obstetrics and gynaecology Canada : JOGC = Journal d'obstetrique et gynecologie du Canada : JOGC, 37(12), 1104–1112. https://doi.org/10.1016/s1701-2163(16)30076-7
2. Garofalo, M., & Aggarwal, R. (2018). Obstetrics and Gynecology Modified Delphi Survey for Entrustable Professional Activities: Quantification of Importance, Benchmark Levels, and Roles in Simulation-based Training and Assessment. Cureus, 10(7), e3051. https://doi.org/10.7759/cureus.3051
3. ten Cate, Olle. (2018). A primer on entrustable professional activities. Korean Journal of Medical Education. 30. 1-10. 10.3946/kjme.2018.76.
4:15 pm
Louise Belfield1
Steven Roberts1, Chinedu Agwu1, Shafeena Anas1, Lisa Jackson1, Michael Ferenczi1 and Naomi Low-Beer1
1 Brunel University London
Steven Roberts1, Chinedu Agwu1, Shafeena Anas1, Lisa Jackson1, Michael Ferenczi1 and Naomi Low-Beer1
1 Brunel University London
Background:
Brunel Medical School opened in 2022, embedding Team-Based Learning within a Programmatic Assessment strategy. Individual readiness assurance tests (iRATs), sampled longitudinally, count towards student progression decisions, requiring rapid creation of a high- volume, high-quality bank of single-best-answer (SBA) questions. Large Language Models (LLMs) such as ChatGPT are widely utilised for rapid text generation (1), presenting opportunity to expedite question-bank building.
Brunel Medical School opened in 2022, embedding Team-Based Learning within a Programmatic Assessment strategy. Individual readiness assurance tests (iRATs), sampled longitudinally, count towards student progression decisions, requiring rapid creation of a high- volume, high-quality bank of single-best-answer (SBA) questions. Large Language Models (LLMs) such as ChatGPT are widely utilised for rapid text generation (1), presenting opportunity to expedite question-bank building.
Summary of work:
Using an iterative process, our team of scientists, clinicians, and educators refined and tested a framework of ChatGPT instructions which enabled generation of high-quality SBAs, integrating specific commands for SBA writing.
Results:
ChatGPT could: Create: new SBAs from scratch, including items for training, without compromising the existing question bank, and could generate meaningful feedback. Quality assure: SBAs, ensuring specific structural requirements were met (including passing the “cover test”), incorporating EDI considerations, meeting regulatory requirements, and pass item performance analysis (psychometrics). Adapt: items, whilst ensuring a consistent style, improving those that are poorly performing, generating SBAs from different item formats, changing the educational context (e.g. year of study), modifying the regulatory, geographical, linguistic and cultural context, and altering question complexity. Engage: educators in co- creation of SBAs, developing a culture of teamwork and creating a community of practice. The effectiveness of ChatGPT in achieving these outcomes was influenced by the precision of commands and by specifying the professional discipline of the question writer.
Discussion:
There are few reports on how faculty may utilise LLMs to develop and quality assure SBAs (2). We demonstrate how optimised ChatGPT commands can address challenging aspects of SBA writing, and yield high-volume, high-quality items.
Conclusions:
ChatGPT can accelerate SBA writing and quality assurance, adapting items for different global and educational contexts.
Take-home message:
With optimised use, ChatGPT is an effective educational tool to create, quality-assure and adapt SBA items, and engage the learning community.
References (maximum three)
- Sullivan, M., Kelly, A. and McLaughlan, P. (2023) ‘ChatGPT in higher education: Considerations for academic integrity and student learning’, Journal of Applied Learning and Teaching, 6(1). Available at: https://doi.org/10.37074/jalt.2023.6.1.17.
- Sabzalieva, E., and Valentini, A. (2023) ChatGPT and Artificial Intelligence in higher education: Quick start guide. Unesco. https://unesdoc.unesco.org/ark:/48223/pf0000385146
4:30 pm
Debra Sibbald1
Andrea Sweezey1
1 Leslie Dan Faculty of Pharmacy, University of Toronto
Andrea Sweezey1
1 Leslie Dan Faculty of Pharmacy, University of Toronto
Background
Online admissions examinations, inherently vulnerable to academic dishonesty, are at heightened risk due to rapid emergence of generative artificial intelligence (AI) tools such as Chat GPT, since November 2022. We explored the impact of strategies to reduce or address AI threats prior to and after the 2023 spring admission cycle at the Faculty of Pharmacy, UT.
Method
Key threats to online exam vulnerabilities were identified through appraisal of published evidence. We instituted deterrent stratagems: increasing question and answer length, shortening timing, disabling copy functions and personalizing questions for individual contexts. Academic misconduct penalties were emphasized. In a field test, selected participants were asked to circumvent possible detection and use AI. Assessors were trained in awareness and recognition using examples of AI responses and practiced rating authentic vs AI samples. A red flag system was employed to alert for suspicious examples. Admission test results were analyzed for detection and red flags comments.
Results
On analysis, it was not possible for assessors to detect AI responses. Tactics and approaches aimed at altering question elements were considered ineffective.
Conclusions
Published threat themes include detection, academic and professional credibility, implications for knowledge work and application, ethics and digital equity. Potentially effective strategies include actionable misconduct penalties; revision of assessment items to increase personalized elements and discourage prompt writing; and focused assessor training with AI samples.
Discussion
The defensibility of high stakes admissions tests is increasingly threatened by rapidly evolving technologies which overcome deterrent approaches. Challenges to authenticity include validity, reliability, equivalence, impact on learning, feasibility, acceptability and sustainability.
Take home messages
AI-generated responses challenge the fairness, transparency, and objectivity of virtual high- stakes online tests impacting applicants, assessors, assessments, and institutions. Vigilant scrutiny for threats and realistic timely responsive strategies are imperative. Revisit possibilities of one-on-one live in person or online interviews, resources permitting.
References (maximum three)
Dwivedi YK, Kshetri N, Hughes L, Slade EL, Jeyaraj A, Kar AK, et al. “So what if ChatGPT wrote it?” Multidisciplinary perspectives on opportunities, challenges and implications of generative conversational AI for research, practice and policy. International Journal of Information Management. 2023;71.
Giannos P, Delardas O. Performance of ChatGPT on UK Standardized Admission Tests: Insights From the BMAT, TMUA, LNAT, and TSA Examinations. JMIR Medical Education. 2023;9(1):e47737
Abd-Elaal E-S, Gamage SH, Mills JE. Assisting academics to identify computer generated writing. European Journal of Engineering Education. 2022;47(5):725-45.
4:45 pm
Kenji Yamazaki1
Stanley Hamstra2 and Eric Holmboe3
1 Accreditation Council For Graduate Medical Education
2 University of Toronto
Stanley Hamstra2 and Eric Holmboe3
1 Accreditation Council For Graduate Medical Education
2 University of Toronto
Abstract
ACGME uses Milestone data to estimate residents’ probability of reaching recommended milestones graduation goals. These predictive probabilities are now supported by validity evidence, including lower interpersonal communication skills (ICS) ratings associated with higher post-training patient complaints. However, ACGME’s current predictive probability estimation lacks consideration of rating variability among programs. In this proof-of-concept study, we addressed these limitations using machine learning (ML) techniques to predict the penultimate milestones ratings.
We analyzed the ICS ratings in emergency medicine milestones. The training set included 5808 residents from 144 programs across four academic cohorts (2013-2016); the test set comprised 1639 residents from 141 programs in the 2017 cohort. Using 11 algorithms, we predicted which residents would be rated 3.5 or lower (5-point scale) for the ICS subcompetency during the penultimate assessment period. Predictors consisted of each resident's rating progression over the initial three semi-annual assessments, including program identifiers. Model evaluation utilized the f1-score (combination of sensitivity and positive predictive value(ppv)) and accuracy metrics on the test set.
The support vector machine algorithm with program identifiers was the best-fitting model in the training dataset, outperforming any other models without program identifiers. The test set included 959 residents (59%) from 129 programs, who were rated 3.5 or lower. Model evaluation yielded 0.78, 0.82, 0.75, and 0.74 for f-1, sensitivity, ppv, and accuracy, respectively.
Forecasting the performance status of residents at the penultimate assessment period based on the initial three assessment periods depends on program-level rating behaviors, suggesting ML approaches can produce tailored probability estimation for each resident by program.
Our approach enables program directors to identify struggling trainees earlier and support their competency development under program-specific educational conditions. Further research should extend this method to other competencies for other specialties and include trainee’s and training site’s characteristics relevant to trainee’s early career performance.
References (maximum three)
Holmboe ES, Yamazaki K, Nasca TJ, Hamstra SJ. Using longitudinal milestones data and learning analytics to facilitate the professional development of residents: early lessons from three specialties. Acad Med. 2020;95(1):97–103.
Han M, Hamstra SJ, Hogan SO, Holmboe E, Harris K, Wallen E, Hickson G, Terhune KP, Brady DW, Trock B, Yamazaki K, Bienstock JL, Domenico HJ, Cooper WO. Trainee Physician Milestone Ratings and Patient Complaints in Early Posttraining Practice. JAMA Netw Open. 2023 Apr 3;6(4):e237588.
Hamstra SJ, Yamazaki K. A Validity Framework for Effective Analysis and Interpretation of Milestones Data. J Grad Med Educ. 2021 Apr;13(2 Suppl):75-80.