Ottawa 2024

Times are shown in your local time zone GMT

Standard setting and validity

Oral Presentation

2:00 pm

27 February 2024

M209

Themes

Theme 8: Evaluation

Session Program

2:00 pm

“Which way were you leaning?” The impact of two borderline categories in borderline regression standard setting

Jacob Pearce - Principal Research Fellow - Australian Council for Educational Research

Jacob Pearce¹
Vernon Mogol¹, Gabes Lau², Barry Soans² and Anne Rogan²
1 Australian Council for Educational Research
2 Royal Australian and New Zealand College of Radiologists

Borderline regression standard setting is considered best-practice for determining pass marks in Objective Structured Clinical Examinations (OSCEs).(1) Candidates receive question-based marks for stations and examiners also provide a global ratings of candidate performance. The global scales themselves may be purely categorical, but are often 5-point ordinal scales. Recent work has interrogated whether these scales are also interval (of equal distance) in practice.(2) However, the impact of which category labels are chosen on the validity of the standard setting process is under-researched.

A 6-point categorical scale was applied during borderline regression for the Royal Australian and New Zealand College of Radiology (RANZCR) OSCERs. This scale was based on a number of similar categorical scales from the assessment literature, and did not involve numbers.(3) The scale comprised three ‘passing’ categories: Outstanding, Clear Pass and Borderline Pass, and three ‘failing’ categories: Borderline Fail, Clear Fail and Significant Concerns. We hypothesised that two borderline categories would be more helpful to examiners, than the one borderline category used in the previous RANZCR Viva Examinations. When examiners are pressed on a borderline rating, they can often tell you which way they are leaning. Examiners underwent training and calibration, and were advised that “borderline pass” should be considered a “minimally competent candidate”.

The BR standard setting data was highly detailed and psychometrically robust. Separating the borderlines into two categories worked well in practice. Examiners found the application of the scale straightforward. The data demonstrated an empirical difference between the two borderline categories, and provided more nuanced assessment evidence for review by the OSCER panel.

The precise wording used in categorical rating scales does impact the standard setting outcomes. But the more important factor to consider is how examiners conceptualise the minimally competent candidate, and appreciate the differences between levels of candidate performance captured in the rating scale.

References (maximum three)

1. Boursicot, K., Kemp, S., Wilkinson, T., Findyartini, A., Canning, C., Cilliers, F., & Fuller, R. (2021). Performance assessment: Consensus statement and recommendations from the 2020 Ottawa Conference, Medical Teacher, 43:1, 58-67, DOI: 10.1080/0142159X.2020.1830052

2. McGown, P.J., Brown, C.A., Sebastian, A. et al. (2022). Is the assumption of equal distances between global assessment categories used in borderline regression valid?. BMC Med Educ 22, 708. https://doi.org/10.1186/s12909-022-03753-5

3. Pearce, J., Reid, K., Chiavaroli, N., Hyam, D. (2021). Incorporating aspects of programmatic assessment into examinations: aggregating rich information to inform decision-making. Med Teach. 43(5):567-574. DOI:10.1080/0142159X.2021.1878122

2:15 pm

Virtual Standard Setting: Strategies and Considerations

Karen Fung - Lead Psychometrician - Pharmacy Examining Board of Canada

Karen Fung¹
John Pugsley¹ and Salma Satchu¹
1 Pharmacy Examining Board of Canada

At the onset of the COVID-19 pandemic, testing organizations developed and implemented innovative, yet defensible methodologies to ensure assessment processes would still take place. This was especially true for licensure/certification bodies in healthcare settings as the demands for healthcare professionals were high. With the adopted philosophy of “the show must go on” under the rapidly changing circumstances, psychometricians and testing organizations had to adapt their traditional processes and make use of current technology resources without jeopardizing the validity of exams. Standard setting is a process of establishing a cut score for an assessment. It is an important process involving subject matter experts (SMEs) to ensure levels of performance, or pass/fail decisions are accurately defined. Prior to COVID-19, most testing organizations conducted standard-setting in-person due to the need for engaged, in- depth discussions involving SMEs. In this presentation, the Pharmacy Examining Board of Canada will discuss the transition of in-person standard setting to a virtual format for performance-based exams (i.e., OSCE). The considerations in planning, adaptations made, along with participants’ feedback will be presented from a psychometric perspective. Although psychometric analyses such as generalizability theory demonstrated that the virtual standard setting is comparable to the in-person format, the in-person format is still preferable for various reasons. The decision to revert to the in-person format will also be discussed.

References (maximum three)

Cizek, G. J., & Bunch, M. B. (2007). Standard setting. Sage Publications.

Cizek, G. J., Bunch, M. B., & Koons, H. (2004). Setting performance standards: Contemporary methods. An NCME Instructional Module.

2:30 pm

Supporting marker judgement and validity with levels-based marking schemes

Neville Chiavaroli - Principal Research Fellow - Australian Council for Educational Research, Jacob Pearce - Principal Research Fellow - Australian Council for Educational Research

Neville Chiavaroli¹
Jacob Pearce¹
1 Australian Council for Educational Research

Background

Marking schemes are essential for guiding assessors to evaluate candidate responses in line with learning objectives and the requirements of constructed-response tasks. They contribute to the validity of written assessments, as reflected in contemporary validity frameworks which recognise the importance of the scoring phase of assessment (Cook et al, 2015). In health professions assessment contexts, such schemes can be quite prescriptive, allocating specific marks to relevant and correct statements by candidates. Such ‘points-based’ schemes reflect the common focus on knowledge of typical written examination questions. However, for more complex questions aiming to elicit higher-order reasoning, such as diagnostic reasoning or justifications for proposed management, marking schemes need to enable and guide suitable assessor judgement.

Discussion

‘Levels-based’ marking schemes require assessors to identify broad levels of anticipated responses according to the accuracy, defensibility and/or overall quality of the response. These levels are ideally based on an underlying principle for discriminating between responses, along with a suitable range of marks allocated to responses judged at each level (Ahmed&Pollitt, 2011). A common concern among assessors is that such schemes are inherently subjective and therefore open to interpretation. Yet the alternative – drafting and applying points-based schemes to higher-order questions for the sake of ‘objectivity’ – risks undermining the validity of those assessment tasks.

Implications for practice

While the appeal of points-based schemes is understandable in terms of efficiency and perceived objectivity, the ultimate criterion for the choice of marking scheme is its alignment with the nature of the question and expected performance. Levels-based schemes, supported through appropriate marker training and calibration, provide more appropriate guidance to support marker appraisal of candidate responses to higher-order questions. As in clinical assessment more broadly, we need to accept and value the necessary subjectivity in assessment which comes with expert judgement (ten Cate & Regehr, 2019).

References (maximum three)

Ahmed A, Pollitt A. 2011. Improving marking quality through a taxonomy of mark schemes. Assess Educ. 18(3):259-278.

Cook DA, Brydges R, Ginsburg S, Hatala R. 2015. A contemporary approach to validity arguments: a practical guide to Kane's framework. Med Educ. 49(6):560-575.

ten Cate O, Regehr G. 2019. The Power of Subjectivity in the Assessment of Medical Trainees. Acad Med. 94(3):333-337.

2:45 pm

A new continuous method for conjunctive Conditions in setting OSCE cut scores

Marina Sawdon - Director Of Assessment - University of York, Hull York Medical School -, John McLachlan

Marina Sawdon¹
John McLachlan²
1 University of York, Hull York Medical School
2 University of Central Lancashire

Background
Conjunctive hurdles for OSCEs are a widespread practice in healthcare professions’ assessments(1), avoiding compensation between stations, which may be a challenge to patient safety. In addition, conjunctive approaches can only operate with integer numbers; a rule that candidates should pass 75% of stations is easy to apply when there are 16 stations, but not when there are 15.

Summary of work
Permission for this retrospective analysis of anonymised data was granted by the University Ethics Committee. Anonymised candidate total scores from several medical undergraduate OSCEs were analysed. Cut scores were determined using Borderline Regression.

Method
Part 1. The mean number of stations passed for each 5% score band were plotted with scores as the abscissae and average number of stations as the ordinate.

Part 2. A normal distribution ogive was fitted to the data, minimising the differences between data set and curve.

Results
The curves generated in Part 1 are indicative of an ogive of a normal distribution. The mean and standard deviation of this gives evidence of the level of difficulty and the discrimination of each OSCE, providing a good descriptor of the performance of the OSCE as a whole. In addition, any desired conjunctive can now be read off as a score.

Discussion
This normal distribution curve ogive provides a unified description of the difficulty and discrimination of the OSCE circuit as a whole.

Conclusions
This eases concerns about reliability. In addition a single cut score integrating the conjunctive condition and the station scores can be calculated.

Take home message
This avoids the possibility that a candidate might ‘pass’ on the cut score but fail on the conjunctive condition, and allows the same conjunctive condition (e.g. that 75% of stations must be passed) to be employed, no matter how many stations are present in the OSCE circuit.

References (maximum three)

1. Ben-David MF. 2000. AMEE Guide No. 18: Standard setting in student assessment. Med Teach. 22(2):120–130

3:00 pm

Angoff: A consensus on standard, or a reflection of group norms?

Daniel Zahra - Associate Professor Of Assessment And Psychometrics - Peninsula Medical School, University of Plymouth

Daniel Zahra¹
Louise Belfield²
1 Peninsula Medical School
2 Brunel Medical School

Background:
Whilst standard setting assessments using the Angoff method is widely used, and often seen as a ‘gold standard’ for criterion-based pass marks, a steadily growing body of work (e.g. Burr et al 2017; 2022) has questioned both the underlying assumptions (what is a borderline candidate, or minimal competence, or difficulty?) and the practical implementation of the method (how many judges are needed, should judgements be proportional or binary?) These are all important questions, but ones which often overlook the social aspect of the Angoff process; from trying to define the criteria, to conceptualising the borderline candidate; from judge’s experiences to the group dynamics in moderation.

Summary and Findings:
This work analyses variation in individual, pre-moderation and collectively discussed post-moderation Angoff judgements across the last ten years of single- best answer multiple-choice applied medical and dental knowledge in our medical and dental schools. We consider the number of items changed and direction of change not only as a function of previously studied variables such as number of judges and types of judgement, but also in relation to the expertise of each judge, and their seniority in the school structure.

Discussion:
Findings are discussed in relation to managing these factors and their impact, their relationship to the assumptions of the Angoff standard setting method, and how to incorporate them effectively into staff training to raise awareness and promote reflection on the personal and social aspects of standard setting, working towards shared consensus and understanding of the required standards.

Take-Home Points:
There is no ‘gold-standard’ for standard setting, and social factors are as important to consider as pedagogic ones, especially where group deliberation is a part of the method. However, these factors can enhance the process and increase its value when acknowledged and incorporated appropriately.

References (maximum three)

Burr, S., Martin, T., Edwards, J., Ferguson, C., Gilbert, K., Gray, C., Hill, A., Hosking, J., Johnstone, K., Kisielewska, J., Milsom, C., Moyes, S., Rigby-Jones, A., Robinson, I., Toms, N., Watson, H., & Zahra, D. (2021) Standard setting anchor statements: a double cross-over trial of two different methods. MedEdPublish 10(1) Article 32

Burr, S.A., Zahra, D., Cookson, J., Salih, V.M., Gabe-Thomas, .E, & Robinson, I.M. (2017) Angoff anchor statements: setting a flawed gold standard?. MedEdPublish, 6(3) Article 53.