Times are shown in your local time zone GMT
Ad-blocker Detected - Your browser has an ad-blocker enabled, please disable it to ensure your attendance is not impacted, such as CPD tracking (if relevant). For technical help, contact Support.
Data management and analytical approaches
Oral Presentation
Oral Presentation
2:00 pm
27 February 2024
M213
Themes
Theme 8: Evaluation
Session Program
2:00 pm
Joanna McFarlane1
Marcus Edwards1
1 Australian Pharmacy Council
Marcus Edwards1
1 Australian Pharmacy Council
This presentation describes a trial of a psychometric theory(1) in high stakes exams, and how software design can support widening the bottleneck of exam content development.
In early 2021 our psychometric consultants suggested ‘pairwise scaling’ as a solution to alleviate the bottleneck of adding new items to our Intern Written exam. Similar to a comparative judgement technique used for marking(2), we designed a tool for SMEs to compare new and anchor items to produce a dataset used to calculate a perceived scale rating for use as scored items in live exams.
The method has been trialled in 2 remote workshops producing comparison data to determine the scale values for new items to use in our exams in 2021 and 2022. Data from the workshops was analysed for judge consistency and produced perceived item difficulty for new items informed by the anchor items and used when developing our 2022 exam forms.
Data results showed SME speed during the task was a key success factor in alignment with responses across the group, despite their clinical backgrounds. Results analysis and evaluation of all exam sessions in 2022 shows consistency between SME data and live candidate data. We identified outliers across all items presented 2022 that were evaluated and resolved through revising our training messages for SMEs in subsequent workshops.
We believe our application of pairwise scaling is an effective method to alleviate the bottleneck of developing exam content for our exams. and invite all discussion and any suggestions to the process application or data evaluation and analyses during our presentation too.
Our experiences can inform other health programs seeking to use innovative methods and understand the effectiveness of pairwise scaling as a robust method to develop new items for scaled exams.
References (maximum three)
1. Andrich D. (1978) 'Relationships between the Thurstone and Rasch approaches to item scaling’, Applied Psychological Measurement, 2, 2, 451-462.
2. Bramley T. (2005) 'A Rank-Ordering Method for Equating Tests by Expert Judgement', Journal of Applied Measurement, 6(2), 202-223.
2:15 pm
Zheng-Wei Lee1
Olivia Ng1, Li Li1, Jowe Chu1, , Lucy Victoria Everett Wilding1, , Jennifer Anne Cleland1 and Dong Haur Phua1
1 Lee Kong Chian School of Medicine, Nanyang Technological University
Olivia Ng1, Li Li1, Jowe Chu1, , Lucy Victoria Everett Wilding1, , Jennifer Anne Cleland1 and Dong Haur Phua1
1 Lee Kong Chian School of Medicine, Nanyang Technological University
Major curriculum reform is typically accompanied by review of assessment processes. Reports of assessment change implementation are scarce nor is there guidance in respect of using technology and data analytics productively to inform and communicate assessment data. Thus, this abstract aims to provide an overview of the preliminary stages of introducing programmatic assessment (PA) as part of an MBBS curricular reform. PA shifts the emphasis from solely formal examination-type assessments to authentic competencies-based assessment from multiple sources, including clinical workplace assessments, and time points (Norcini et al., 2018; van der Vleuten et al., 2012). We focused to develop an assessment dashboard that provides clear alignment between curriculum and assessment, signals progress and promotes self-regulated learning for students, by adopting a human-centric approach using the four iterative and solution-focused phases - discovery, ideation, experimentation, and evolution - design thinking framework (Henriksen et al., 2017). At the discovery phase, we identified our primary challenge as consolidating different assessment data. Streaming the assessment data infrastructure was a critical - and highly complex - first step, requiring engagement with various stakeholders including faculty members, administrators, and students, to understand their needs and requirements for an assessment dashboard, as well as identifying technological enablers (ideation). At this stage we also had to constructively align our learning outcomes and assessment items with the Singapore National Framework. Next, we reviewed our data architecture by testing the gathering of assessment information from different formative assessment items and learning platforms (experimentation). Evolution in the form of piloting will happen later in 2024. We report on this user-centric development process to help others considering a shift to PA, to offer transferable and practical working insights for stakeholders who seek to design personalised feedback which is fit for context, and uses technology and data analytics effectively.
References (maximum three)
Henriksen, D., Richardson, C., & Mehta, R. (2017). Design thinking: A creative approach to educational problems of practice. Thinking skills and Creativity, 26, 140-153. https://doi.org/10.1016/j.tsc.2017.10.001
Norcini, J., Anderson, M. B., Bollela, V., Burch, V., Costa, M. J., Duvivier, R., Hays, R., Palacios Mackay, M. F., Roberts, T., & Swanson, D. (2018). 2018 consensus framework for good assessment. Medical Teacher, 40(11), 1102-1109. https://doi.org/10.1080/0142159X.2018.1500016
van der Vleuten, C. P. M., Schuwirth, L. W. T., Driessen, E. W., Dijkstra, J., Tigelaar, D., Baartman, L. K. J., & van Tartwijk, J. (2012). A model for programmatic assessment fit for purpose. Medical Teacher, 34(3), 205–214. https://doi.org/10.3109/0142159x.2012.652239
2:30 pm
Zoe Brody
Shelley Ross1
1 University of Alberta
Shelley Ross1
1 University of Alberta
Background:
Learning analytics, including performance and behavioural metrics, are used to understand and improve learning and learning environments. In our program, we use FieldNotes (documentation of feedback in workplace-based teaching) as part of programmatic assessment of learners. FieldNotes can de-identified and used for learning analytics. Recent concerns in the literature about health inequities for female patients with cardiovascular issues prompted us to consider ways that FieldNotes learning analytics data could be used to explore specific program evaluation questions. In this study, we used these learning analytics data to examine clinical teaching about gender differences in chest pain presentation.
Learning analytics, including performance and behavioural metrics, are used to understand and improve learning and learning environments. In our program, we use FieldNotes (documentation of feedback in workplace-based teaching) as part of programmatic assessment of learners. FieldNotes can de-identified and used for learning analytics. Recent concerns in the literature about health inequities for female patients with cardiovascular issues prompted us to consider ways that FieldNotes learning analytics data could be used to explore specific program evaluation questions. In this study, we used these learning analytics data to examine clinical teaching about gender differences in chest pain presentation.
Summary of work:
We used secondary data analysis of 12 years (July 2011 - June 2023) of archived FieldNotes learning analytics data. FieldNotes include narrative feedback to learners and descriptions of patient presentations, thus serving as a proxy for clinical teaching. FieldNotes about chest pain (search terms: “chest pain”, “heart”, “MI”) were searched for the term ‘atypical’, and then the sex of the patient was determined. Chi-square goodness of fit to tested the assumption that the proportion of ‘atypical’ classification was equal across sexes.
We used secondary data analysis of 12 years (July 2011 - June 2023) of archived FieldNotes learning analytics data. FieldNotes include narrative feedback to learners and descriptions of patient presentations, thus serving as a proxy for clinical teaching. FieldNotes about chest pain (search terms: “chest pain”, “heart”, “MI”) were searched for the term ‘atypical’, and then the sex of the patient was determined. Chi-square goodness of fit to tested the assumption that the proportion of ‘atypical’ classification was equal across sexes.
Results:
The database (N = 64,942) search identified 677 (1.04%) FieldNotes about chest pain. 76 (11.2%) described symptoms as ‘atypical’. Female patients’ chest pain symptoms were described as ‘atypical’ significantly more frequently (X2 = 18.24; df = 2; p = 0.00011).
The database (N = 64,942) search identified 677 (1.04%) FieldNotes about chest pain. 76 (11.2%) described symptoms as ‘atypical’. Female patients’ chest pain symptoms were described as ‘atypical’ significantly more frequently (X2 = 18.24; df = 2; p = 0.00011).
Discussion:
Our findings identified a teaching gap in our program around women’s health. Having objective evidence of this gap can allow for targeted faculty development. Conclusions: Our study demonstrates the value of learning analytics for program evaluation and examination of quality-of-care issues. These data can also contribute to practice quality improvement.
Our findings identified a teaching gap in our program around women’s health. Having objective evidence of this gap can allow for targeted faculty development. Conclusions: Our study demonstrates the value of learning analytics for program evaluation and examination of quality-of-care issues. These data can also contribute to practice quality improvement.
Take-home messages/implications for further research:
Learning analytics data used for program evaluation can identify curriculum and teaching gaps.
Learning analytics data used for program evaluation can identify curriculum and teaching gaps.
References (maximum three)
1. Lee JR, Ross S. A comparison of resident-completed and preceptor-completed formative workplace-based assessments in a competency-based medical education program. Fam Med. 2022;54(8):599-605.
2. Ross S, Lawrence K, Bethune C, van der Goes T, Pélissier-Simard L, Donoff M, Crichton T, Laughlin T, Dhillon K, Potter M, Schultz K. Development, implementation, and meta-evaluation of a national approach to programmatic assessment in family medicine residency training. Academic Medicine 2023; 98(2):188-198
3. Ross S, Poth C, Donoff M, Humphries P, Steiner I, Schipper S, Janke F, Nichols D. The Competency-Based Achievement System (CBAS): Using formative feedback to teach
and assess competencies with Family Medicine residents. Canadian Family Physician. 2011;57:e323-e330.
and assess competencies with Family Medicine residents. Canadian Family Physician. 2011;57:e323-e330.
2:45 pm
Pin-Hsiang Huang1,2,3
Chen-Huan Chen2,1, Tyzh-Chang Hwang2,4 and Boaz Shulruf3,5
1 Taipei Veterans General Hospital
2 National Yang Ming Chiao Tung University
3 University of New South Wales
4 University of Missouri
5 University of Auckland
Chen-Huan Chen2,1, Tyzh-Chang Hwang2,4 and Boaz Shulruf3,5
1 Taipei Veterans General Hospital
2 National Yang Ming Chiao Tung University
3 University of New South Wales
4 University of Missouri
5 University of Auckland
Academic promotion in college of medicine is critical, and many domains and subdomains of the academic performance of the applicants are reviewed during the process. However, thresholds of subdomains for successful promotion are often obscure. This study aims to propose an evidence-based approach to provide the thresholds.
Full-time academic faculty members holding ranks of assistant professor, associate professor and full professor from College of Medicine, National Yang Ming Chiao Tung University were invited to participate. Participants responded to academic promotion scales, which included 3 domains, 21 subdomains (11 in research, 6 in service, and 4 in teaching) and 514 quantitative variables. Means and standard errors of each sub-domain for the three ranks were calculated, and the novel equal-Z model was applied to determine the thresholds for each subdomain between adjacent ranks. Z score for each subdomain was calculated, and each threshold was set by lower rank mean plus its standard error multiplied with Z score. Accuracy of classification was calculated as the percentage (0-100%) of correct classifications of upper and lower ranks based on the threshold for the subdomain.
A total of 41 participants were enrolled with 11 assistant professors, 11 associate professors, and 19 professors. For the thresholds between assistant professor and associate professor, accuracy values were between 36.0% and 77.0%, and 17 subdomains presented accuracy more than 55%. Regarding the thresholds between associate professor and full professor, accuracy values were between 34.5% and 70.0% with 10 subdomains holding accuracy more than 55%.
Subdomains with an accuracy equal to or larger than 55% were considered useful to discriminate teachers’ performance. The model will be fine-tuned with the growth of dataset, and it is self-adjustable forward to new trends. This study demonstrated that Equal-Z model is a feasible algorithm to define thresholds of subdomains to support academic promotion decisions.
References (maximum three)
1. Shulruf, B., Coombes, L., Damodaran, A., ... & Harris, P. (2018). Cut-scores revisited: feasibility of a new method for group standard setting. BMC medical education, 18(1), 1-8.
2. Yeh, J. T., Shulruf, B., Lee, H. C., Huang, P. H., Kuo, W. H., ... & Chen, C. H. (2022). Faculty appointment and promotion in Taiwan’s medical schools, a systematic analysis. BMC Medical Education, 22, 356. https://doi.org/10.1186/s12909-022-03435-2
3. Shulruf, B., Yang, Y. Y., Huang, P. H., Yang, L. Y., Huang, C. C., Huang, C. C., ... & Kao, S. Y. (2020). Standard setting made easy: validating the Equal Z-score (EZ) method for setting cut-score for clinical examinations. BMC Medical Education, 20(1), 1-9.
3:00 pm
Rajneesh Kaur1
Richmond Jeremy, Sally Middleton and Joanne Hart1
1 University of Sydney, School of Medicine
Richmond Jeremy, Sally Middleton and Joanne Hart1
1 University of Sydney, School of Medicine
Background
Each year ~300 mandatory research projects are undertaken by Doctor of Medicine (MD) degree students at the University of Sydney marked by ~ 150 academics and affiliates. Double marking, assessment rubrics and moderation of marking are used to enhance the objectivity and fairness of marking of written assessments. This study examined the impact of these quality assurance measures in the assessment of MD research projects.
Summary of work
We compared first and second assessors’ marks for marking of 801 research project final reports from 2021 and 2023 MD cohorts. Statistical analysis included calculating intraclass correlation coefficient, creating Bland-Altman plots and regression for proportional bias to assess agreement in mark. Consistency of examiner feedback comments was assessed through qualitative analysis, comparing the comments though thematic analysis.
Summary of results
Moderate intraclass correlation coefficient (ICC) was seen for the total mark (0.508; 95% CI: 0.411-0.588) with low to moderate ICC was seen for individual components of the report. Bland-Altman plots and qualitative findings further supported these results as there was a low agreement and lack of uniformity in examiner comments.
Discussion and conclusions
Final marks were awarded by averaging the marks of first and second examiners, addressing the issues around low correlation. Whenever a large inconsistency (a difference of >15 marks between both examiners) was noted between marks awarded by examiners one and two, a third examiner was engaged to help resolve differences. Given a large and changing pool of markers, a calibration process was not possible.
Take home messages / implications for further research and practice
Examiner training and engaging expert in the field as marker are recommended to further improve the quality of written assessment marking. Our next phase involves investigating the perspectives of both students and examiners regarding the assessment methods employedto ensure quality of marking.
References (maximum three)
ØBennett J. Second-marking and the academic interpretative community: ensuring reliability, consistency and objectivity between markers. Investigations in Teaching and Learning. 2005; 4(1): 80-86.
ØGiavarina D. Understanding Bland Altman analysis. Biochem Med (Zagreb). 2015 Jun 5;25(2):141-51. doi: 10.11613/BM.2015.015. PMID: 26110027;
ØRone-Adams S, Naylor S. Examination of the Inter-Rater Agreement among Faculty Marking a Research Proposal on an Undergraduate Health Course. Internet Journal of Allied Health Sciences and Practice. 2009. DOI:10.46743/1540-580X/2009.1267
3:15 pm
Siu Hong Michael Wan1
Hui Yin Wan2
1 The University of Notre Dame. Australia
2 Bond University
Hui Yin Wan2
1 The University of Notre Dame. Australia
2 Bond University
Background
Both modified Angoff and Cohen methods are used in the standard setting of the pass mark of Multiple-Choice Questions (MCQ) in written examinations. Angoff methods are labour intensive with many academics involved and the Cohen method is relatively simple to apply. We compared these standard setting methods in the final year high-stakes examinations of the graduate-entry MD program.
Both modified Angoff and Cohen methods are used in the standard setting of the pass mark of Multiple-Choice Questions (MCQ) in written examinations. Angoff methods are labour intensive with many academics involved and the Cohen method is relatively simple to apply. We compared these standard setting methods in the final year high-stakes examinations of the graduate-entry MD program.
Summary of Work
From 2018 to 2022, cohorts of final year medical students, comprising of 115-125 students, sat the 100-item MCQ paper as part of their year-end summative examinations. A multi- disciplinary panel of clinicians set the pass marks for the papers using the modified Angoff method. The modified Cohen mark was calculated according to the cohort’s performance scores.
Results
Pass marks derived from the modified Angoff method for the 2018-2022 cohorts were 53.3%, 51.2%, 52.2%, 54.8% & 53.0% respectively. The modified Cohen mark was calculated at 65% of the 90th percentile of each student cohort. The Cohen pass marks for the cohorts were 52.0%, 53.3%, 51.9%, 50.2% & 51.3% respectively. There were no significant differences in the number of students who failed the MCQ exam using both standard setting methods.
Discussion
Modified Cohen method gave very similar pass marks as the modified Angoff method in the high-stakes summative MCQ examinations. Limitation of the study includes data from a single medical school.
Modified Cohen method gave very similar pass marks as the modified Angoff method in the high-stakes summative MCQ examinations. Limitation of the study includes data from a single medical school.
Conclusions
This is an encouraging finding. As modified Cohen method requires only simple calculations post examination, this might be a more efficient method in standard setting of high-stake summative examinations.
Take-home Message
The new Modified Cohen method could be used in high-stakes summative examination to set the pass mark of written MCQ papers, minimising human resources and time as well as maximising efficiency. More historical data can be used to verify this method compared to any current standard setting methods.
References (maximum three)
Taylor CA. Development of a modified Cohen method of standard setting. Med Teach. 2011;33(12):e678-82. doi: 10.3109/0142159X.2011.611192. PMID: 22225450.