2,056
Views
27
CrossRef citations to date
0
Altmetric
Research Articles

Adapting the McMaster-Ottawa scale and developing behavioral anchors for assessing performance in an interprofessional Team Observed Structured Clinical Encounter

, , , , &
Article: 26691 | Received 20 Nov 2014, Accepted 07 Apr 2015, Published online: 22 May 2015

Abstract

Background

Current scales for interprofessional team performance do not provide adequate behavioral anchors for performance evaluation. The Team Observed Structured Clinical Encounter (TOSCE) provides an opportunity to adapt and develop an existing scale for this purpose. We aimed to test the feasibility of using a retooled scale to rate performance in a standardized patient encounter and to assess faculty ability to accurately rate both individual students and teams.

Methods

The 9-point McMaster-Ottawa Scale developed for a TOSCE was converted to a 3-point scale with behavioral anchors. Students from four professions were trained a priori to perform in teams of four at three different levels as individuals and teams. Blinded faculty raters were trained to use the scale to evaluate individual and team performances. G-theory was used to analyze ability of faculty to accurately rate individual students and teams using the retooled scale.

Results

Sixteen faculty, in groups of four, rated four student teams, each participating in the same TOSCE station. Faculty expressed comfort rating up to four students in a team within a 35-min timeframe. Accuracy of faculty raters varied (38–81% individuals, 50–100% teams), with errors in the direction of over-rating individual, but not team performance. There was no consistent pattern of error for raters.

Conclusion

The TOSCE can be administered as an evaluation method for interprofessional teams. However, faculty demonstrate a ‘leniency error’ in rating students, even with prior training using behavioral anchors. To improve consistency, we recommend two trained faculty raters per station.

Interprofessional education (IPE), defined most commonly as ‘occasions when two or more professions learn with, from and about each other to improve collaboration and the quality of care’ (Citation1) has received increasing attention in health sciences education. Models of IPE delivery within undergraduate and graduate education involving up to six professions have been reported (Citation2, Citation3). These models include the use of patient simulations for teaching (Citation4, Citation5). Guidelines for curricula to teach desired IPE competencies have proliferated in recent years (Citation6Citation9). However, various reviews (Citation10Citation13) consistently emphasize the need for theoretical frameworks to underpin IPE outcomes research design, to address the inherent complexity of IPE and the influence of learners, curriculum format and timing, faculty abilities and organizational context on learning (Citation14). IPE outcomes research has focused on changes in learner attitudes, knowledge, and collaborative behaviors, mostly in the short term (Citation11). There remains a need for standard-setting and tools that accurately measure and reflect student performance in teams that have potential to be applied to clinical practice settings (Citation15).

Assessment tools that are currently used include attitude measures such as the Readiness for Interprofessional Learning Scale (Citation16) or the Interdisciplinary Education perception Scale (Citation17), and tools such as the TeamSTEPPS communication behaviors and assessment instruments (Citation18, Citation19), all relying on self-assessment. Validated tools that allow independent observer ratings based on objective assessment and documentation of individual and team behaviors are needed to add rigor to the evaluation process.

The Objective Structured Clinical Examination (OSCE) has been used in health professions education as a valid and reliable method for assessing student knowledge and skills through structured observation and the use of standardized patients (SPs) and observer checklists (Citation20, Citation21). Step 2 of the United States Medical Licensing Examination Clinical Skills Examination has used SPs to ‘test medical graduates on their ability to gather information from patients, perform physical examinations, and communicate their findings to patients and colleagues’ (www.usmle.org/step-2-cs/). For interprofessional learning, the 9-point McMaster-Ottawa scale (Citation22) was developed as a checklist with the purpose of allowing observing raters to assess team and individual performance using six core IPE constructs. These constructs are communication, collaboration, roles and responsibilities, collaborative patient-centered approach, conflict management, and team functioning (Citation23, Citation24). The face and content validity of the Team Observed Structured Clinical Encounter (TOSCE) was established and 10 TOSCE topics were selected for development (Citation25, Citation26). The TOSCE purports to evaluate individual and team performance in settings ranging from maternity (Citation27, Citation28) to palliative care (Citation29). However, we were unable to find specific behavioral anchors for rating individual and team behaviors; this creates a challenge for educators attempting to apply the scale in either a standardized simulated or a real clinical setting.

We, therefore, conducted a study to develop standardized behavioral anchors for faculty to rate individual students and interprofessional team performance, using the six McMaster-Ottawa constructs; and to train faculty to use the scale. Our two aims were first, to assess the feasibility of using the retooled scale in a TOSCE setting; and second, to evaluate the ability of faculty raters to use the retooled scale to accurately distinguish different levels of student and team performance. We hypothesized that faculty raters would be able to accurately rate up to four students within an IPE team as well as overall team performance in a 35-min encounter. We also hypothesized that faculty would be able to identify high and low-performing individuals and teams but would have greater difficulty accurately discriminating levels of individual performance in teams with mixed individual performance levels.

The Institutional Review Board of the University of Southern California approved the study.

Methods

Study setting

Our study was conducted at the health science campus of a single institution (the University of Southern California) located in urban Los Angeles, California.

Study participants

Participants were 16 volunteer faculty members representing dentistry, medicine, occupational therapy, pharmacy, and physician assistant professions with experience teaching and assessing students, and no prior experience with IPE assessment. Faculty members were trained as raters immediately prior to the TOSCE administration and were blinded to the purpose of the study, as well as student and IPE team performance levels. We trained volunteer students from a student-run IPE clinic in teams to perform at different levels to assess how well the scale allowed trained, blinded faculty raters to discriminate among the different performance levels. Four SPs were recruited from a database of experienced SP actors to perform at TOSCE stations.

Development of behavioral anchors

Three authors (DL, WM, and RR) examined the six constructs from the McMaster-Ottawa scale and the descriptors associated with each. They determined a priori that it was not feasible to develop anchors for the original 9-point scale as it was extremely difficult to distinguish and describe nine different levels of behaviors for individual students. Through an iterative process of discussion, consensus-building and review, three levels of performance were judged as capable of being distinguished. Level 3 was defined as the highest, or ‘above expected’; 2 as the intermediate, or ‘at expected’; and 1 as the lowest, or ‘below expected’ level. A detailed description of observable teamwork behaviors for each level of individual performance was created, with a final total of 18 (6×3) non-overlapping behavior categories ().

Table 1 Modified McMaster-Ottawa scale for rating individual students, with instructions for 3-point scoring, Keck School of Medicine of the University of Southern California, 2014

The same authors developed anchors for the team rating to evaluate team-level performance separate from individual-level performance. We based anchors on factors reported as associated with better patient outcomes (Citation30). Effective team performance was evaluated based on the perception of the level of care afforded the patient due to the team acting as an integrated whole ().

Table 2 Modified McMaster-Ottawa scale for rating teams, with instructions for 3-point scoring, Keck School of Medicine of the University of Southern California, 2014

Study design and TOSCE administration

This is an exploratory and feasibility study for scale development and implementation. One TOSCE station (case available on www.fhs.mcmaster.ca/tosce/en/tosce_stations.html) was selected and modified by consensus agreement among the authors representing the four student professions involved (medicine, physician assistant, pharmacy and occupational therapy). The case selected (stroke) was deemed to be at an appropriate difficulty level to involve all four professions and capable of testing team and individual behaviors. The case was that of a hospitalized rehabilitating patient with hemiplegic stroke who now requests discharge 1 week after admission. Case instructions required the students to use skills specific to your own discipline and knowledge of others on your healthcare team, to assess the patient's needs and develop a care plan for him. The team communicated only with the patient who was in a wheelchair and who had spousal support at home. The spouse was not present for the encounter. The timeframe of 35 min for the station was based on the published recommendation (www.fhs.mcmaster.ca/tosce/en/toolkit_guidelines.html).

Our focus was on potential differences among faculty in rating students and teams, so it was imperative that we distinguish variation in student scores attributable to raters from variation attributable to station differences. Due to constraints of available faculty time (4 hours) and the length of each TOSCE station (35 minutes), limiting the study to one station (stroke) allowed us to determine variation due to raters alone. We anticipate future research to examine whether or not station differences affect faculty ratings of students and teams.

One week before the TOSCE was administered, the four student teams (teams A, B, C and D) were trained by three authors (DL, CF, KL) to perform at different skill levels. The students portrayed health professions trainees at the beginning of their clinical training. Team A consisted of four level 3 (above expected) students, team B consisted of two level 3 students and two level 2 (at expected) students, team C consisted of two level 1 (below expected) students and two level 2 students, and team D consisted of three level 1 students and one level 2 student. In each team, the lowest-performing student was chosen to be from a different profession. Team A was trained to portray a team functioning ‘above expected’, team B ‘at expected’, team C ‘at expected’ and team D ‘below expected’. Training of students occurred over 3 h with the use of the retooled behavioral anchors () and video demonstrations, followed by practice and feedback from other team members and trainers. Students practiced until the trainers were able to distinguish levels of performance in a mock patient encounter. The faculty trainers did not participate as raters in the actual TOSCE.

Blinded faculty raters were told at recruitment that no prior experience for rating IPE team performance was necessary. They were not informed that students had been trained to perform at different levels of performance until after the TOSCE was completed. They received 60 min of training immediately prior to TOSCE administration. Training consisted of independent review of the retooled scale and anchors and group discussion, followed by a viewing of the same video demonstrations representing three different levels of performance that were shown to the trained student teams. Faculty trainers (DL, CF, KL) stressed that the rating scale assessed only performance related to team behaviors, and not the competency of the students within their own particular professions. Training was deemed to be completed when all 16 raters agreed on the performance level of students and teams shown in the videos. There were four faculty raters from different professions at each TOSCE station. Each rater remained at their one assigned station, thus rating all four teams (16 students representing all three levels of individual and team performance) that rotated through their station. The raters were instructed not to communicate with one another. Faculty observed teams without intrusion, and sat 8–12 feet away from the teams and SP. They were given 10 min to complete ratings after 35 min of observation. Of the 35-min encounter, 5 min were spent on the pre-huddle, 20 min with the SP and the final 10 min in a post-encounter debrief. The team pre-huddle and debrief took place in a room adjacent to the patient encounter. The faculty followed and observed the team during the entire 35 min while the SP had access to the team only during the 20 min of his case performance. A post-TOSCE survey was administered to raters regarding the feasibility of the TOSCE and its utility as a teaching and evaluation tool. At the end of the TOSCE, after all rating forms and surveys were collected, raters were debriefed and the ‘correct’ performance level of each team and student revealed. All encounters and team interactions were videotaped.

Data collection

Rating forms were completed in hard copy. Each rater completed 20 rating forms (16 for individual students and four for the four teams). Post-TOSCE surveys were collected from all raters. De-identified data were entered into Excel format.

Data analysis

Descriptive statistics were used to examine faculty ability to accurately distinguish students and teams performing below, at, and above expectation to assess the feasibility and utility of using such a scale for formative evaluation. For each faculty rater, we constructed student mean performance scores across the six competencies and compared those values to assigned student levels of performance. Individual and team scores and post-TOSCE survey responses were analyzed using SPSS and GENOVA (Citation31).

A generalizability study (G-study) was conducted to examine variability in student scores due to faculty variation as opposed to other sources of error variation. Generalizability theory (G-theory) allows us to disentangle variation in student performance scores due to different sources of measurement error (Citation32, Citation33), such as those attributable to item, station, or rater, and the interactions between them. According to G-theory, variation in student TOSCE performance scores can be deconstructed into person (p) variation, or the variation in examinee ability; and error variation, due to various sources of measurement error, known as facets. Of interest to us, then, is the calculation of variation in scores, or variance components, attributable to each of these facets. Our G-study investigated the relative influence of faculty rater (r) as well as the interaction of person-by-rater (pr). Of particular interest to our study, was the proportion of measurement error in student scores and in faculty accuracy, or ability to correctly identify student performance levels, attributable to trained raters.

Results

TOSCE administration and feasibility

All 16 faculty raters received 60 min of pre-TOSCE training until they reported sufficient familiarity with the scale anchors to begin actual rating. Faculty blinding was successful for 13 raters. Three raters suspected some student pre-training after observing two teams, and reported afterwards that they simply continued rating without any effect on their perception of student or team performance. The remaining raters did not suspect during the TOSCE that the students had been pre-trained. The individual students and teams were observed on remote cameras, and were deemed to be performing at their assigned levels by faculty trainers who rated their performance and provided feedback as needed between station performances. All 16 raters were able to complete five ratings per encounter within the allotted time. A total of 320 rating forms were collected. There were no significant logistic issues.

The post-TOSCE survey response rate was 100% (N=16). On a scale of 1 (strongly agree) to 5 (strongly disagree) faculty believed (i.e., percentage who agreed or strongly agreed) they had adequate time to rate a maximum of four students per station (94%). Faculty agreed/strongly agreed that the TOSCE was useful for assessing individual (81%) and team (81%) performance. Faculty agreed/strongly agreed that this experience made them more competent to rate team skills (81%) and that the TOSCE should be offered as part of IPE curricula (69%). Despite their training, faculty were ‘not highly confident’ about their rating scores for individuals (50% agreed/strongly agreed); however, they expressed ‘high confidence’ on their scores for teams (75% agreed/strongly agreed). Some expressed a need for more training and a simpler rating form in their comments.

Faculty rating ability

Though 16 faculty participated, subsequent analysis of the data utilized scores from only 15; data from one was excluded due to failure to follow directions. Four raters neglected to furnish scores on one or two competencies for some students. Results, however, did not change substantially when data from these raters were excluded. Therefore, when constructing average student performance level scores, data from these raters were included. displays faculty ability to correctly identify student performance levels.

Table 3 Correct and incorrect identification of student performance levels for the TOSCE by faculty rater, Keck School of Medicine of the University of Southern California, 2014

Some faculty were more accurate than others, evidenced by a range () in the number of correctly identified performance level of students, from 6 (38%) to 13 (81%). No faculty correctly identified the performance level of all 16 students. The average number of students correctly and incorrectly identified by performance level by faculty revealed that correctly identifying students performing ‘below expected’ was the most difficult for faculty. In fact, more students portraying ‘below expected’ performance on average were scored by faculty as performing ‘at expected’ or even, in some instances, ‘above expected’ (M=2.7, or 54% of students) than at their correct performance level (M=2.3, or 46% of students). Faculty were on average more accurate in their designation of students performing at (M=3.6, or 72% of students) and above (M=3.9, or 65% of students) expectation. For team performance, individual faculty accurately rated 50–100% of team performances. Faculty were more accurate in assessing the level of team performance for the high- and low-performing teams (88% correct for the ‘above expected’; 100% correct for the ‘below expected’ teams) and less accurate with ‘at expected’ teams (50% correct; with 50% incorrectly rated as ‘below expected’).

G-study findings and implications

We performed a G-study to examine the variation in student scores attributable to faculty alone and to the interaction of student and faculty. displays estimated variance components of these various sources of measurement error, or facets, in student scores, and provides G-study results for a TOSCE involving one, two and four faculty raters. Because students were assigned specific levels of performance, it is important to note that we cannot draw any conclusions from these calculations about the variation in student ability captured by TOSCE scores. Though our calculations for a four-faculty TOSCE – in which each student is scored by four faculty raters – indicated that the level of student performance differed substantially between students with over 80% of the total variance attributable to systematic differences between students, this variation is ‘manufactured’ because our trained students were assigned in nearly equal numbers to portray all three performance levels. Our calculations for a one-station TOSCE involving four faculty rating students on six competencies indicated a small percentage (nearly 4%) of variation in student scores were attributable to faculty rater (0.01058), indicating that compared to one another, no faculty rater was more lenient or strict than another. A very small percentage (0.15%) of variation was attributable to competency (0.00042), indicating that the six competency categories were equally difficult for students. We attributed a larger proportion (about 11%) of the variance in scores to the interaction between person, or student, and rater (0.03061) suggesting that the relative standing of students may vary from rater to rater. In a TOSCE involving two raters, the percent of total variance attributable to the interaction of student and rater was, as expected, even higher (0.06122) at about 18%.

Table 4 Estimated variance components for student performance scores on TOSCE, Keck School of Medicine of the University of Southern California, 2014

We also conducted a G-study to examine variation in faculty ability to correctly identify student performance levels using faculty accuracy scores, based on the comparison of faculty average student scores to assigned student performance levels. Faculty were either ‘correct’ or ‘incorrect’ in their assessment of student performance level. displays these results. In this analysis, our calculations for a four-team TOSCE, in which students are ‘nested’ within teams, showed variation in faculty ability to accurately score student performance. Nearly 25% of the total variance in faculty accuracy scores was attributable to systematic differences between faculty raters. A moderate percentage of variation in faculty accuracy was attributable to the interaction of faculty and team (0.00487, or about 19%), indicating that the relative accuracy of faculty raters may vary from student team to student team. Additionally, there was a large percentage (nearly 34%) of variation in faculty accuracy attributable to the interaction of faculty rater, student nested within team (s:t) commingled with random error (0.00883). These results reaffirmed the need to address the potential impact of faculty–student and faculty–team interactions on performance scores when administering the TOSCE.

Table 5 Estimated variance components for faculty ability to correctly identify student performance level on TOSCE, Keck School of Medicine of the University of Southern California, 2014

Discussion

We conducted a study to examine the feasibility of conducting a TOSCE using a retooled McMaster-Ottawa scale with behavioral anchors to standardize observer ratings. We offered the ideal conditions under which the scale could perform, by providing variability for all three levels of performance among the students and teams, as well as pre-training faculty to rate using the retooled scale. We found that students and teams could be rated by trained faculty within a 35-min encounter. We met our hypothesis that faculty were able to distinguish the lowest and highest levels of performance for both individuals and teams. We found that errors in rating students tended to occur in the direction of over-rating student performance. In other words, faculty tended to assign higher levels of performance even when observing lowest-level performance behaviors, that is, they demonstrated the ‘leniency error’ documented in other evaluation studies (Citation34, Citation35). To reduce such errors in real-life assessment, we recommend either Rater Error Training or Frame-of-Reference Training with an emphasis on an increase in the number of observations especially for lower-performing students (Citation36). Error Rater Training seeks to improve the accuracy of ratings by correctly identifying and decreasing common ‘rater biases’ or ‘rater errors’ due to factors such as leniency or central tendency. Frame-of-reference training refers to using a reference point to provide a match between the rater's scores and the ratees’ true scores, and relies on the content rather than the process of rating to reduce rater bias.

In addition, other studies (Citation37, Citation38) found that observers had difficulty distinguishing among 11 team competencies and recommended that researchers use the simplest factor structure when assessing team work. In our TOSCE, there were six team competencies that could have contributed to the challenge of accurate rating. Future studies using more stations and raters should permit factor analysis with the aim of further simplifying the scale structure. Some of our variation in faculty ability to accurately assess individual-level performance may also have been due to inadequate rater training. We found that having more than one rater increased rating reliability. This is similar to the findings of Hull (Citation39) where high inter-observer agreement was reached with two trained raters for the Observational Teamwork Assessment for Surgery with five teamwork behaviors.

In our study, students were assigned in nearly equal numbers to portray all three performance levels, leading to an unusually high level of variation in student ability. Were we to administer the TOSCE to students in the real world, we would very likely not achieve similar results in terms of faculty discrimination. The attributable student-rater variance we found (11% for one rater and 18% for two raters) suggests that to ensure adequate reliability, we would likely need more than one faculty rater in each station were we to administer the TOSCE to untrained (i.e., real world) students.

We purposefully limited our study to assessing faculty rating accuracy by excluding the effect of the clinical station on the retooled scale and to permit more rigorous examination of the scale in the real world setting. Our study has several strengths. One is that quality of student performance was tightly controlled by training and observation of performance during the TOSCE. Another is the use of G-theory to examine relative sources of error in student performance scores. Although three of the blinded raters were able to guess that students had been pre-assigned to perform at different levels, they were not influenced by this suspicion in their ratings. One study limitation is that the proportion of lowest-performing students was one-third in our study, a ratio much higher than usually seen in health professions education. Another limitation is the small number of raters and teams, due to the time constraint of completing the study within a 4-h timeframe. Future research should examine the impact of station differences on rating accuracy, and involve higher numbers of faculty raters, with the inclusion of raters from other professions.

Conclusion

Use of the adapted TOSCE scale with behavioral anchors is feasible when administered to an interprofessional team of up to four students. Faculty pre-training allows for evaluation of performance. We recommend that a team of at least two faculty raters be assigned per station, to more accurately rate individuals, and that more focused training be provided to address the tendency for faculty to avoid scoring students poorly.

Conflict of interest and funding

The authors declare no conflict of interest associated with this study. This project was supported by the Health Resources and Services Administration (HRSA) of the US Department of Health and Human Services (HHS) under grant #D57HP23251, Physician Assistant Training in Primary Care, 2011–2016.

Disclosure

The information, content and conclusions are those of the authors and should not be construed as the position or policy of the HRSA, HHS or US Government.

Acknowledgements

We are deeply grateful to participating students and faculty for their time and contribution. We are indebted to Drs. Denise Marshall, Beth Murray-Davis, and Sheri Burns of McMaster University, Canada, for their inspiration, guidance, and feedback, and to Dr. Cha Chi Fung for manuscript review.

References

  • World Health Organization. Learning together to work together for health. 1988. Report of a WHO study group on multiprofessional education for health personnel: the team approach. Switzerland: World Health Technical Report Series; 769.
  • Buckley S , Hensman M , Thomas S , Dudley R , Nevin G , Coleman J . Developing interprofessional simulation in the undergraduate setting: experience with five different professional groups. J Interprof Care. 2012; 26: 362–9. [PubMed Abstract].
  • Pinto A , Lee S , Lombardo S , Salama M , Ellis S , Kay T , etal. The impact of structured inter-professional education on health care professional students’ perceptions of collaboration in a clinical setting. Physiother Can. 2012; 64: 145–56. [PubMed Abstract] [PubMed CentralFull Text].
  • Symonds I , Cullen L , Fraser D . Evaluation of a formative Interprofessional Team Objective Structured Video Examinations (ITOSCE): a method of shared learning in maternity education. Med Teach. 2003; 25: 38–41. [PubMed Abstract].
  • Simpson D , Helm R , Drewniak T , Ziebert M , Brown D , Mitchell J , etal. Objective Structured Video Examinations (OSVEs) for geriatrics education. Gerontol Geriatr Educ. 2006; 26: 7–24. [PubMed Abstract].
  • Royal College of Physicians and Surgeons of Canada. Interprofessional education and training in the United States: resurgence and refocus. 2011. Available from: http://rcpsc.medical.org/publicpolicy/imwc/Interprofessional_Education_US_Brandt_Schmitt.PDF [cited 10 February 2015]..
  • World Health Organization. Framework for action on interprofessional education & collaborative practice. 2010. Available from: http://whqlibdoc.who.int/hq/2010/WHO_HRH_HPN_10.3_eng.pdf [cited 10 February 2015]..
  • Canadian Interprofessional Health Collaborative. A national interprofessional competency framework. 2010. Available from: www.cihc.ca/files/CIHC_IPCompetencies_Feb1210.pdf [cited 10 February 2015]..
  • Interprofessional Education Collaborative Expert Panel. (2011). Core competencies for interprofessional collaborative practice: Report of an expert panel. Washington, D.C.: Interprofessional Education Collaborative..
  • Zwarenstein M , Goldman J , Reeves S . Interprofessional collaboration: effects of practice-based interventions on professional practice and healthcare outcomes. Cochrane Database Syst Rev. 2009; 3: CD000072. [PubMed Abstract].
  • Reeves S , Zwarenstein M , Goldman J , Barr H , Freeth D , Koppel I , etal. The effectiveness of interprofessional education: key findings from a new systematic review. J Interprof Care. 2010; 24: 230–41. [PubMed Abstract].
  • Thistlethwaite J . Interprofessional education: a review of context, learning and the research agenda. Med Educ. 2012; 46: 58–70. [PubMed Abstract].
  • Olson R , Bialocerkowski A . Interprofessional education in allied health: a systematic review. Med Educ. 2014; 48: 236–46. [PubMed Abstract].
  • Cooper H , Geyer R . Using ‘complexity’ for improving educational research in health care. Soc Sci Med. 2008; 67: 177–82. [PubMed Abstract].
  • Curran V , Casimiro L , Banfield V , Hall P , Lackie K , Simmons B , etal. Research for Interprofessional Competency-Based Evaluation (RICE). J Interprof Care. 2009; 23: 297–300. [PubMed Abstract].
  • McFadyen A , Webster V , Maclaren W . The test-retest reliability of a revised version of the Readiness for Interprofessional Learning Scale (RIPLS). J Interprof Care. 2006; 20: 633–9. [PubMed Abstract].
  • McFadyen A , Maclaren W , Webster V . The Interdisciplinary Education Perception Scale (IEPS): an alternative remodeled sub-scale structure and its reliability. J Interprof Care. 2007; 21: 433–43. [PubMed Abstract].
  • Brock D , Abu-Rish E , Chiu C , Hammer D , Wilson S , Vorvick L , etal. Interprofessional education in team communication: working together to improve patient safety. BMJ Qual Saf. 2013; 22: 414–23. [PubMed Abstract].
  • TeamSTEPPS. Team strategies and tools to enhance performance and patient safety. Available from: http://www.collaborate.uw.edu/educators-toolkit/tools-for-evaluation.html-0 [cited 18 February 2015]..
  • Miller GE . The assessment of clinical skills competence performance. Acad Med. 1990; 65: S63–7. [PubMed Abstract].
  • Regehr G , Freeman R , Hodges B , Russell L . Assessing the generalizability of OSCE measures across content domains. Acad Med. 1999; 74: 1320–2. [PubMed Abstract].
  • The McMaster-Ottawa Team Observed Structured Clinical Encounter. (TOSCE) (2010). McMaster/Ottawa TOSCE (Team Observed Structured Clinical Encounter) Toolkit. Guidelines for Conducting the McMaster-Ottawa TOSCE Within your Practice. Available from: http://fhs.mcmaster.ca/tosce/en/toolkit_guidelines.html [cited 10 February 2015]..
  • Marshall D , Hall P , Taniguchi A . Team OSCEs: evaluation methodology or educational encounter?. Med Educ. 2008; 42: 1129–30. [PubMed Abstract].
  • Simmons B , Egan-Lee E , Wagner S , Esdaile M , Baker L , Reeves S . Assessment of interprofessional learning: the design of an interprofessional objective structured clinical examination (iOSCE) approach. J Interprof Care. 2010; 20: 1–2.
  • Solomon P , Marshall D , Boyle A , Casimiro LM , Hall P , Weaver L . Establishing face and content validity of the McMaster-Ottawa Team Observed Clinical Encounter (TOSCE). J Interprof Care. 2011; 25: 302–4. [PubMed Abstract].
  • Singleton A , Smith F , Harris T , Ross-Harper R , Hilton S . An evaluation of the Team Objective Structured Clinical Examination (TOSCE). Med Educ. 1999; 33: 34–41. [PubMed Abstract].
  • Cullen L , Fraser D , Symonds I . Strategies for interprofessional education: the Interprofessional Team Objective Structured Clinical Examination for midwifery and medical students. Nurse Educ Today. 2003; 23: 427–33. [PubMed Abstract].
  • Murray-Davis B , Solomon P , Malott A , Marshall D , Mueller V , Shaw E , etal. A Team Observed Structured Clinical Encounter (TOSCE) for pre-licensure learners in maternity care: a short report on the development of an assessment tool for collaboration. J Res Interprof Pract Educ. 2013; 3: 122–8.
  • Hall P , Marshall D , Weaver L , Boyle A , Taniguchi A . A method to enhance student teams in palliative care: piloting the McMaster-Ottawa team observed structured clinical encounter. J Palliat Med. 2011; 14: 744–50. [PubMed Abstract].
  • Rosen MA , Weaver SJ , Lazzara EH , Salas E , Wu T , Silvestri S , etal. Tools for evaluating team performance in simulation training. J Emerg Trauma Shock. 2010; 3: 353–9. [PubMed Abstract] [PubMed CentralFull Text].
  • Crick GE , Brennan RL . GENOVA: a generalized analysis of variance system [FORTRAN IV computer program and manual]. 1982; Dorchester, MA: Computer Facilities, University of Massachusetts at Boston.
  • Brennan RL . Generalizability theory. 2001; New York, NY: Springer.
  • Richter RA , Lagha MA , Boscardin CK , May W , Fung CC . A comparison of two standard-setting approaches in high-stakes clinical performance assessment using generalizability theory. Acad Med. 2012; 87: 1077–82.
  • Iramaneerat C , Yudkowsky R . Rater errors in a clinical skills assessment of medical students. Eval Health Prof. 2007; 30: 266–83. [PubMed Abstract].
  • McManus IC , Thompson M , Mollon J . Assessment of examiner leniency and stringency (‘hawk–dove effect’) in the MRCP (UK) clinical examination (PACES) using multi-facet Rasch modeling. BMC Med Educ. 2006; 6: 42. [PubMed Abstract] [PubMed CentralFull Text].
  • Feldman M , Lazzara EH , Vanderbilt AA , DiazGranados D . Rater training to support high-stakes simulation-based assessments. J Contin Educ Health Prof. 2012; 32: 279–86. [PubMed Abstract] [PubMed CentralFull Text].
  • Baker DP , Salas E , King H , Battles J , Barach P . The role of teamwork in the professional education of physicians: current status and assessment recommendations. Jt Comm J Qual Patient Saf. 2005; 31: 185–202. [PubMed Abstract].
  • Smith-Jentsch KA , Johnston JH , Payne SC , Cannon-Bowers JA , Salas E . Measuring team-related expertise in complex environments. Making decisions under stress: implications for individual and team training. 1998; Washington, DC: American Psychological Association. 61–87.
  • Hull L , Arora S , Kassab E , Kneebone R , Sevdalis N . Observational teamwork assessment for surgery: content validation and tool refinement. J Am Coll Surg. 2011; 212: 234–43. [PubMed Abstract].