2,777
Views
14
CrossRef citations to date
0
Altmetric
Articles

Remote and onsite scoring of OSCEs using generalisability theory: A three-year cohort study

ORCID Icon, , , & ORCID Icon

Abstract

Introduction: Onsite scoring is common in traditional OSCEs although there is the potential for an audience effect facilitating or inhibiting performance. We aim to (1) analyze the reliability between onsite scoring (OS) and remote scoring (RS); and (2) explore the factors that affect the scoring in different locations.

Methods: A total of 154 students and 84 raters were enrolled in a single-site during 2013–2015. We selected six stations randomly from a 12-station national high-stakes OSCE. We applied generalisability theory for the analysis and investigated the perceptions that affected RS scoring.

Results: The internal consistency reliability Cronbach’s α of the checklists was 0.92. The kappa agreement was 0.623 and the G value was 0.93. The major source of variance comes from the students themselves, but some from locations and raters. The three-component analysis including Technical Feasibility, Facilitates Wellbeing, and Observational and Attention Deficits explained 73.886% of the total variance in RS scoring.

Conclusions: Our study has demonstrated moderate agreement and good reliability between OS and RS ratings. We validated the factors of facility operation and quality for RS raters. Remote scoring can provide an alternative forum for the raters to overcome the barriers of distance, space, and avoid the audience effect.

Practice points

  • Onsite scoring is common in traditional OSCEs with potential audience effect.

  • Audience effect means the presence of audience possibly facilitating or inhibiting performance.

  • Facility operation and quality are the affecting factors in remote scoring.

  • Remote scoring provides an alternative forum for assessment.

  • The optimization of facility is important between the financial consideration and facility operation.

Introduction

Since its introduction in 1975 (Harden et al. Citation1975), the OSCE has become a popular tool in medical education to assess the core competencies of medical students (Short et al. Citation2009). Carefully planned content, known as blueprinting, and assessments can test curricula learning objectives (Dauphinee et al. Citation1994; Wass et al. Citation2001) and enable students to show how according to Miller’s pyramid of competence (Miller Citation1990). Traditionally, teachers rated students in close physical proximity—by their side, in the same room, or from behind a one-way mirror. From the perspective of social facilitation theory and the audience effect, the mere presence of an audience can lead to better performance by individuals in some cases (e.g. simple performance accuracy or well-trained tasks) but worse in others (e.g. if the individual is performing a complex task) (Bond and Titus Citation1983; Aiello and Douthitt Citation2001). Given the variability of students’ skills (a simple task for some might be perceived as a complex one for others) alongside the variability across OSCE stations (some are simply more complex than others), we cannot predict the impact of close proximity raters on the performance of students and standardized patients (SPs).

In the 1970s, one-way mirrors were used to observe the performance of students during OSCEs, which led to an examination of their effect on student performance. Despite psychological research suggesting that the performance of the task can be affected by an audience (Cohen and Davis Citation1973), research in medical education settings has reported no significant differences between physicians graded by an examiner in the same room and those graded by an examiner behind a one-way mirror (Corley and Mason Citation1976). Nevertheless, research suggests that one-way mirrors have some disadvantages, including intimidating participants, providing a limited view, and increasing noise and light pollution (Ford Citation2008; Sauro Citation2016). Additionally, the audio quality from behind the mirror can be too poor for the raters to score effectively, and adequate space is needed for their construction; two aisles to separate raters and students are necessary to prevent contact between them. In the past few years, high technology for more distant observations (Chan et al. Citation2014)—such as video-based (Sturpe et al. Citation2010), camera-assisted, or web-based scoring (Novack et al. Citation2002)—has been used to rate students’ performance.

Furthermore, one of the most important factors for OSCE scoring is reliability. Brannick et al. reviewed the reliability of OSCEs systematically in a total 188 alpha values from 39 studies and reported that the overall alpha across stations was 0.66, the overall alpha within stations across items was 0.78 and the generalisability coefficient was 0.49 (Brannick et al. Citation2011). Sturpe et al. (Citation2010) examined the issue of intrarater reliability between real-time and video-based observation. Even though the same raters were used for each condition, 13.3% of the students that were rated as passing based on real-time observation were rated as failing based on video observation. In addition, 3.3% of students that were rated as failing in real-time observation passed when rated by video observation (Sturpe et al. Citation2010). Reliability is very important for the assessment, especially for critical decision-making in the high-stakes OSCE.

The challenge of large-scale, multisite examinations

Since the application of OSCEs as end-of-course assessments to their use today as part of the certification and licensing process (Reznick et al. Citation1996; Whelan Citation2000; Dillon et al. Citation2004; Trewby Citation2005), the issue of large-scale and multisite OSCEs has been debated and investigated. Research has examined issues such as the recruitment and training of raters and SPs, differences between facility settings across different sites, and the reliability of raters (Rahayu et al. Citation2016).

Although research is beginning to address the issue of reliability and validity of multisite OSCE examinations, few studies have investigated the reliability and validity between remote and onsite OSCE scoring systems. Chan et al. reported a preliminary study comparing OSCE scores for on-site (same room) and remote-site (webcam in another building) scorers (Chan et al. Citation2014). Although technical issues led to inconsistent data collection, they found high correlations for three of the six stations—history taking, physical examination, and management—using both systems, but global ratings varied greatly (Chan et al. Citation2014). They concluded that remote examination might be a feasible and acceptable way of assessing students’ clinical skills despite technological issues. However, this study reported preliminary data with insufficient evidence to clarify the inter-rater reliability and the validity to identify the confounding factors in a high-stakes OSCE.

In summary, the challenge is for raters to observe the students’ performance closely and clearly without interrupting their performance. The aims of this study are to (1) analyze the reliability between OS and RS raters, and (2) explore the factors that affect the accuracy of scoring in OS and RS locations. The research question is: Is there a difference in the assessment of medical student performance on the OSCE when undertaken via OS rating in the examination rooms and via RS rating out of the rooms?

Methods

Context

The national high-stakes OSCE for seventh year medical students has been held in Taiwan since 2013. It takes place across multiple sites during a six-day period annually. It consists of a twelve-station track that includes the broad areas of history taking, clinical reasoning, skills, management, communication, and counseling. Two circuits of students take the OSCE test in a day. Specific tasks at the 12 stations differ day by day, and are set by the Taiwan Society of Medical Education (TSME). The scoring sheet consists of a checklist including 10–15 items (scale 0-1-2) to rate specific skills at each station and a global rating score (1–5 points) for the overall score. All raters are qualified according to the requirements of the TSME, who also regulate the OS raters to ensure equality across sites. The scores and the decision for a pass or fail rating for each task are also determined by the TSME.

Study design

A three-year, quasi-experimental cohort study was undertaken. The Institutional Review Board Ethical Committee of Chang Gung Medical Foundation approved the study. Participants were seventh year medical students who participated in the national high-stakes OSCE at one hospital during the period of 2013–2015.

Six stations were randomly selected among the 12-station OSCE every day. Twelve raters were divided into two paired groups, one OS and one RS. The OS raters scored the students in the examining rooms and RS raters scored the same students out of the examining rooms observing them through two real-time video cameras in the central control room. Both cameras had zoom-in/zoom-out functions with 360° adjustment. A total of 154 students (male 63.6%) and 84 raters were enrolled (). During the study period, we analyzed 924 pairs of ratings.

Table 1. Participant demographics and locations.

We developed a questionnaire to examine perceptions by RS raters of the factors that affected their scoring. These 10 items focused on the performance and effectiveness of the hardware and raters’ experiences of their rating task using a 1–5 point Likert scale from complete disagreement to complete agreement.

Statistics and analysis

We calculated the results using SPSS.12.0 with students’ t-test and the chi-square test results and presented them with mean ± standard deviations (SD). The reliability of the checklists was calculated by Cronbach’s alpha. The agreement of the raters between the two groups for pass or fail was tested by the kappa value. We also analyzed the correlation of the checklist scores and global ratings. We applied G study by EDU-G for the three-year cohort to examine the reliability of the following facets: participants, circuits, days, raters, and OS-RS locations. We used the results from this to inform our D study (decision) for optimization of the process.

Principle component analysis (PCA) with varimax rotation was performed to analyze the factors affecting scoring as measured by the questionnaire. The intragroup and intergroup correlation between checklist scores and global ratings was tested by Pearson’s correlation. Significance was set at p < 0.05.

Results

OSCE ratings

No significant differences were found between OS and RS checklist scores; three-year mean checklist scores were 17.70 ± 4.22 in the OS group and 17.64 ± 4.42 in the RS group without significance (p = 0.42). Overall, the internal consistency reliability of the checklists was Cronbach’s α = 0.92. A significant difference was found for the mean global ratings: 3.59 ± 0.87 for the OS group and 3.51 ± 0.91 for the RS group (p < 0.01). The overall percentage of agreement for pass/fail ratings of OS and RS raters was 91.8% (848/924) and the kappa value was 0.623 (0.375–1.000).

The overall intragroup Pearson’s correlation of checklist scores and global ratings in each group was significant (p < 0.001, r = 0.642). It also showed significant intergroup Pearson correlation of checklist scores (p < 0.001, r = 0.851) and global rating scores (p < 0.001, r = 0.571).

Generalisability study by EDU-G

The analysis of variance indicated that the corrected components of variance for the three-year cohort data comprised: participants 10.844, raters 0.002, locations 0.003, participant-rater 1.578, and participant-rater-locations 6.594. The components of variance consisted of participants (57%), participant-raters (8.3%), and participant-raters-locations (34.6%) (). In the G study of differentiation, the variance of participants (10.844) had an absolute error variance for locations: 0.002 (0.2%); participant-raters: 0.263 (32.2%); and participant-raters-locations: 0.549 (67.3%) (). The three-year coefficient G value was 0.93 (0.95, 0.84, and 0.96 for each year, respectively). For the D-study, we fixed the participants and locations. Following this, we optimized the numbers of raters by increasing them from 6 to 11. The relative/absolute coefficient G values were 0.946/0.945, 0.841/0.838, 0.963/0.963, but could be achieved to 0.970/0.969, 0.906/0.904, 0.980/0.980 when extrapolating to 11 raters ().

Table 2. Analysis of variance among participants, raters and locations of scoring.

Table 3. Generalisability analysis of variance in the cohort period.

Table 4. Optimization analysis by the increment of raters.

Questionnaire

Fifty-eight questionnaires were mailed to the RS raters and 51 were returned for analysis. (See for details.) Factor analysis with significant Bartlett’s sphericity test (p < 0.001) was noted and Kaiser–Meyer–Olkin (KMO) value was 0.703. Ten items could be divided into three groups as a three-component model with eigenvalues >1.0 (ranging from 1.632 to 3.150).

Table 5. Questionnaire items, means (SD) examining remote scorers’ perceptions of the factors that affect their scoring.

The three-component model is as follows: Component 1) Technical Feasibility, consisting of ease of operation (0.898), video-quality (0.891), camera setting (0.871), and audio-quality (0.773); Component 2) Facilitates Wellbeing, consisting of fewer interruptions to students for RS (0.908), fewer interruptions to Standardized patients (SPs) for RS (0.890), and less stress at RS (0.818); Component 3) Observational and Attention Deficits, consisting of difficulties in observing technical skills at RS (0.893), difficulty in observing interview skills at RS (0.877), and being easily distracted (0.545). The percentage of components 1, 2, and 3 that accounted for the variance was 30.412%, 24.381%, and 19.093%, respectively. The three-component analysis explained 73.886% of the total variance ().

Table 6. Principle component analysis with Varimax rotation.

Discussion

We analyzed data from a three-year cohort study to ascertain the reliability and feasibility of remote and onsite scoring in a national high-stakes OSCE. Our findings build on the work undertaken by Chan et al. (Citation2014), who demonstrated significant correlations between three (of six) OS and RS examiners’ checklist scores with more variable correlations between OS and RS examiners’ global ratings. Unfortunately, their preliminary study did not provide data on intrarater or inter-rater reliability. Moving the field forward, our study suggests that good reliability between remote and onsite scoring can be achieved not only between individual stations but also between global rating scores for intrarater and inter-rater groups. We further found that the major source of variance comes from the students themselves, with some also coming from the rating location. However, we found very little variation from the raters.

Testing knowledge and performance is difficult and complicated. We need the right measurement to differentiate the valid data and the confounders. Generalisability theory allows us to separate noise from signal, identify sources of noise, and devise ways to reduce its contribution to final results. The application of the theory provides a way to analyze the results of psychometric tests, such as the OSCE (Bloch and Norman Citation2012). Brannick et al. reviewed the reliability of OSCEs systematically and reported variable of the alpha values and the generalisability coefficient (Brannick et al. Citation2011). The results of high-stakes OSCE will be criticized if there is no good reliability. In our study, we selected six tasks from a twelve-station track to compare the differences between remote and onsite scoring. We obtained good G values and were able to extrapolate estimated G values if we had selected more than six tasks by optimization through a D study.

In this national OSCE test, the raters scored the students, but did not decide to pass or fail each of them immediately. The scores for passing or failing each task were determined by the TSME using the borderline group method with regression. The raters were trained and qualified by the regulations and requirements of TSME. Among the twelve task stations, TSME randomly assigned six raters from other testing sites to avoid the halo effect (Iramaneerat and Yudkowsky Citation2007). Our results suggest a good correlation between the checklist scores and the global ratings. The results showed very little variation from the raters and provided a way for us to review the recruitment and training program of the raters.

Facility operation and technological quality were factors that concerned the RS raters. We have therefore demonstrated how the use of high technology, remote video observation can provide an alternative forum for OSCE assessment, instead of OS observation. The most important aspect of concern regarding technology-assisted scoring for the raters concerns hardware feasibility. The advantages of RS scoring include less psychiatric stress for the raters, fewer interruptions to the students and standardized patients. Besides, the raters can observe detail steps such as suturing skills by the cameras. The most important aspect of concern for the raters concerns hardware feasibility. Even in the conventional two-aisle setting observation, the problem of audio quality from behind the mirror still exists to be concerned for the scoring but the mirror can provide at least one observational view to avoid the problem of camera-related video quality. We can increase the numbers of camera setting or set both cameras and mirrors if we want to provide more views to observe the students. However, it will increase the cost of hardware and the problem of facility operation for the raters. The optimization of facility setting is also an important issue for the scoring effectiveness between the financial consideration and the feasibility of facility operation.

Study strengths and limitations

As with all research, our study has limitations. Despite this being a national OSCE with many testing sites throughout the country, our study occurred at a single-site. Although our finding that RS appears as reliable as OS, further large-scale and multisite studies are needed to validate the advantages and disadvantages of remote scoring. Despite this limitation, our study has strengths, including the use of rigorous data collection methods that enabled us to randomize OSCE stations and systematically collect data over multiple time periods. Furthermore, we had sufficient data to undertake a G study to examine inter-rater reliability and a subsequent D study to optimize the ratio of raters to the number of stations.

Future research

Although research has addressed some of the issues, we believe that raters’ reliability remains an area for further research to unpack the wider range of factors underlying successful scoring. Additionally, training for raters is required to minimize rating errors and help them reach a consensus for grading to ensure fairness. Furthermore, given that we have demonstrated some of the benefits of remote assessment during OSCE, we have not yet addressed the financial feasibility of this method. Remote video observation and scoring requires expensive high-technology hardware and software that in turn requires regular maintenance. Finally, the security and safety of the cables and the Internet remains a concern. As such, a cost-benefit study should be conducted to better understand when and where OS and RS ratings are beneficial.

Conclusions

Our study has demonstrated moderate agreement and good reliability between OS and RS ratings. Remote scoring can provide a way for raters to dissolve the barriers of distance and space, and also avoid the interruption and bias of an audience effect because of the presence of the teachers or raters. The expanded use of reliable, high technology-assisted scoring can also provide training practice to junior raters without interruption to the students and standardized patients.

Glossary

Remote rating/scoring: Provides a way for the raters by the application of reliable user-friendly high technology to rate students’ performance, such as video-based, camera-assisted, or web-based scoring for more distant observations to overcome the barriers of distance and space, and also avoid the interruption and bias of an audience effect.

Audience effect: An audience effect comes from psychology and states that the mere presence of an audience can lead to better performance by individuals in some cases (e.g. simple performance accuracy or for well-trained tasks) but worse in others (e.g. if the individual is performing a complex task).

Disclosure statement

The authors report no conflicts of interest. The authors alone are responsible for the content and writing of this article.

Additional information

Funding

This work was supported by Kaohsiung Chang Gung Memorial Hospital, Chang Gung Medical Foundation, Taiwan [grant number CDRPG8E0033].

Notes on contributors

Te-Chuan Chen

Te-Chuan Chen, MD, Division of Nephrology, Department of Internal Medicine, Kaohsiung Chang Gung Memorial Hospital, Kaohsiung, Taiwan. School of Medicine, Chang Gung University College of Medicine, Tao-Yuan, Taiwan. Chang Gung Memorial Hospital Linkou Branch, Chang Gung Medical Education Research Centre, Tao-Yuan, Taiwan.

Meng-Chih Lin

Meng-Chih Lin, MD, School of Medicine, Chang Gung University College of Medicine, Tao-Yuan, Taiwan. Division of Pulmonary Medicine, Department of Internal Medicine, Kaohsiung Chang Gung Memorial Hospital, Kaohsiung, Taiwan.

Yuan-Cheng Chiang

Yuan-Cheng Chiang, MD, School of Medicine, Chang Gung University College of Medicine, Tao-Yuan, Taiwan. Department of Plastic and Reconstructive Surgery, Kaohsiung Chang Gung Memorial Hospital, Kaohsiung, Taiwan.

Lynn Monrouxe

Lynn Monrouxe, PhD, Chang Gung Memorial Hospital Linkou Branch, Chang Gung Medical Education Research Centre, Tao-Yuan, Taiwan.

Shao-Ju Chien

Shao-Ju Chien, MD, Division of Pediatric Cardiology, Department of Pediatrics, Kaohsiung Chang Gung Memorial Hospital, Kaohsiung, Taiwan. School of Traditional Chinese Medicine, Chang Gung University College of Medicine, Tao-Yuan, Taiwan.

References

  • Aiello JR, Douthitt EA. 2001. Social facilitation from triplett to electronic performance monitoring. Group Dyn Theory Res Pract. 5:163–180.
  • Bloch R, Norman G. 2012. Generalizability theory for the perplexed: A practical introduction and guide: AMEE Guide No. 68. Med Teach. 34:960–992.
  • Bond CF Jr, Titus LJ. 1983. Social facilitation: a meta-analysis of 241 studies. Psychol Bull. 94:265–292.
  • Brannick MT, Erol-Korkmaz HT, Prewett M. 2011. A systematic review of the reliability of objective structured clinical examination scores. Med Educ. 45:1181–1189.
  • Chan J, Humphrey-Murto S, Pugh DM, Su C, Wood T. 2014. The objective structured clinical examination: can physician-examiners participate from a distance? Med Educ. 48:441–450.
  • Cohen JL, Davis JH. 1973. Effects of audience status, evaluation, and time of action on performance with hidden-word problems. J Pers Soc Psychol. 27:74–85.
  • Corley JB, Mason RL. 1976. A study on the effectiveness of one-way mirrors. J Med Educ. 51:62–63.
  • Dauphinee D, Fabb W, Jolly B, Langsley D, Wealthall S, Procopis P. 1994. Determining the content of certification examinations. The certification and recertification of doctors: issues in the assessment of clinical competence. Cambridge: Cambridge University Press; p. 92–104.
  • Dillon GF, Boulet JR, Hawkins RE, Swanson DB. 2004. Simulations in the United States Medical Licensing Examination (USMLE). Qual Saf Health Care. 13:i41–i45.
  • Ford AE. 2008. The effects of two-way mirrors, video cameras, and observation teams on clients? judgements of the therapeutic relationship. Theses and Dissertations (All). 30. Indiana University of Pennsylvania, US. http://knowledge.library.iup.edu/etd/30.
  • Harden RM, Stevenson M, Downie WW, Wilson GM. 1975. Assessment of clinical competence using objective structured examination. Med Educ. 1:447–451.
  • Iramaneerat C, Yudkowsky R. 2007. Rater errors in a clinical skills assessment of medical students. Eval Health Prof. 30:266–283.
  • Miller GE. 1990. The assessment of clinical skills/competence/performance. Acad Med. 65:S63–S67.
  • Novack DH, Cohen D, Peitzman SJ, Beadenkopf S, Gracely E, Morris J. 2002. A pilot test of WebOSCE: a system for assessing trainees' clinical skills via teleconference. Med Teach. 24:483–487.
  • Rahayu GR, Suhoyo Y, Nurhidayah R, Hasdianda MA, Dewi SP, Chaniago Y, Wikaningrum R, Hariyanto T, Wonodirekso S, Achmad T. 2016. Large-scale multi-site OSCEs for national competency examination of medical doctors in Indonesia. Med Teach. 38:801–807.
  • Reznick RK, Blackmore D, Dauphinee WD, Rothman AI, Smee S. 1996. Large-scale high-stakes testing with an OSCE: Report from the Medical Council of Canada. Acad Med. 71:S19–S21.
  • Sauro J. 2016. Reflecting on the one-way mirror. [accessed 2018 Sep 6]. https://measuringu.com/one-way-mirror/.
  • Short MW, Jorgensen JE, Edwards JA, Blankenship RB, Roth BJ. 2009. Assessing intern core competencies with an objective structured clinical examination. J Grad Med Educ. 1:30–36.
  • Sturpe DA, Donna H, Stuart TH. 2010. Scoring objective structured clinical examinations using video monitors or video recordings. Am J Pharm Educ. 74:44.
  • Trewby PN. 2005. Assisting international medical graduates applying for their first post in the UK: What should be done? Clinic Med. 5:126–132.
  • Wass V, Van Der Vleuten C, Shatzer J, Jones R. 2001. Assessment of clinical competence. Lancet. 357:945–949.
  • Whelan G. 2000. High-stakes medical performance testing: The Clinical Skills Assessment program. J Am Med Assoc. 283:1748.