1,154
Views
12
CrossRef citations to date
0
Altmetric
Research Article

Design, validation and dissemination of an undergraduate assessment tool using SimMan® in simulated medical emergencies

, , , , &
Pages e12-e17 | Received 31 Oct 2008, Accepted 22 Jul 2009, Published online: 22 Jan 2010

Abstract

Background: Increasingly, medical students are being taught acute medicine using whole-body simulator manikins.

Aim: We aimed to design, validate and make widely available two simple assessment tools to be used with Laerdal SimMan® for final year students.

Methods: We designed two scenarios with criterion-based checklists focused on assessment and management of two medical emergencies. Members of faculty critiqued the assessments for face validity and checklists revised. We assessed three groups of different experience levels: Foundation Year 2 doctors, third and final year medical students. Differences between groups were analysed, and internal consistency and interrater reliability calculated. A generalisability analysis was conducted using scenario and rater as facets in design.

Results: A maximum of two items were removed from either checklist following the initial survey. Significantly different scores for three groups of experience for both scenarios were reported (p < 0.001). Interrater reliability was excellent (r > 0.90). Internal consistency was poor (α < 0.5). Generalizability study results suggest that four cases would provide reliable discrimination between final year students.

Conclusions: These assessments proved easy to administer and we have gone some way to demonstrating construct validity and reliability. We have made the material available on a simulator website to enable others to reproduce these assessments.

Introduction

Since Tomorrow's Doctors was published, undergraduate medical education has changed to place more emphasis on clinical skills and to increase preparedness for the junior doctor's role (General Medical Council Citation2003). The Acute Care Undergraduate Teaching (ACUTE) initiative was published in 2005 (Perkins et al. Citation2005) in response to a number of publications raising concerns about the care of acutely ill patients (McQuillan et al. Citation1998; Franklin & Mathew Citation2002; Hodgetts et al. Citation2002; Cullinane et al. Citation2005). This report details competencies in the care of acutely ill patients, which the group suggests should be integrated into undergraduate curricula.

Undergraduate acute care skills are most commonly assessed by ‘paper simulation’; however, written examinations are more likely to test knowledge alone rather than the complex integration of applied knowledge with clinical skills and problem-solving ability. Simulator manikins can be used for observation-based competence assessments to enable a higher level of Miller's pyramid to be assessed: ‘shows how’ (Miller Citation1990).

Simulator manikins are being increasingly used in undergraduate education (Bradley Citation2006). These manikins vary in sophistication and technical detail (ranging from low to high fidelity), but most are able to reproduce the haemodynamics of the critically ill patient, making them ideally suited for teaching acute care skills. Simulators are also ideally placed for use in evaluating students’ acute care skills; the environment is safe, assessments can be easily standardised and importantly, the assessment setting may have more authenticity than traditional assessment methods (Schuwirth & Van der Vleuten Citation2003). Furthermore, with increasing student numbers, the need to develop assessment methods that do not involve patients is great (Maran & Glavin Citation2003; Bradley Citation2006).

There is much in the literature concerning the reliability and validity of assessment tools using simulators, mostly in the anaesthetic field. A 2001 review of 13 papers reporting design of assessment tools for doctors using high-fidelity simulators was critical of the reliability and validity evaluations made (Byrne & Greaves Citation2001). Since this review, further studies using high-fidelity simulators have been published reporting reliability and validity of checklist assessments in anaesthetics and medical emergencies (Morgan & Cleave-Hogg Citation2000b; Morgan et al. Citation2001b; Murray et al. Citation2002; Boulet et al. Citation2003; Gordon et al. Citation2003; Morgan et al. Citation2004). Previous studies regarding undergraduate assessments in this area have reported that checklist assessments are associated with high interrater reliability and have demonstrable construct validity, determined by assessing differing experience levels (Devitt et al. Citation1998; Morgan & Cleave-Hogg Citation2000a; Devitt et al. Citation2001; Morgan et al. Citation2001b; Murray et al. Citation2002; Boulet et al. Citation2003; Morgan et al. Citation2004). Convergent validity has not been established in that simulator assessment results do not correlate well with other assessments e.g. written (Morgan et al. Citation2001b). It is possible that this is due to written assessments testing different constructs to that of simulator assessments.

More recently, the Laerdal SimMan® has become available, which is ‘moderate or medium fidelity’, lower in cost and, according to the manufacturers, has 90% of the market share in the UK. The SimMan® has many similar features to high-fidelity models, but may be less suited to certain scenarios, e.g. neurological emergencies, since its pupils are non-reactive.

Designing and validating assessment tools is a time-consuming and lengthy process. Previous studies in this area, including one using SimMan® (Weller et al. Citation2004), are extremely useful to use as a framework on which to base further tool evaluation. However, no previous work, to the authors’ knowledge, has made all the material (including software programmes) available for others to reproduce the assessments, and therefore, avoid duplicating the validation process for their own assessments.

Aim

The primary aim of this study was to develop a robust formative assessment that could be used to assess the acute care skills of final year medical students at the end of an Emergency Medicine attachment using the widely available SimMan®. The assessment tool was designed to operate with limited resources, so that it was feasible and practical to deliver to a reasonable number of students (on average, 15) rotating every 3 weeks. Our secondary aim was to disseminate the results and tools, including checklists and pre-programmed software, so that other centres could easily make use of our assessment material.

Methods

Simulator and setting

We used the Laerdal SimMan®. This medium-fidelity simulator is a life-size manikin that breathes, talks, has palpable pulses, audible chest, heart and bowel sounds, and is connected to a monitor for displaying oxygen saturations, ECG trace, pulse rate and blood pressure. The manikin is connected to a computer, and the assessment scenarios were pre-programmed for consistency; each time we used a scenario, the parameters (pulse, breath sounds, oxygen saturation, etc.) were the same. Furthermore, the software enables pre-programmed standard responses to student actions e.g. administering oxygen.

The SimMan® is located in a clinical skills laboratory with appropriate ‘props’ such as oxygen masks, cannulation equipment and fluids. In addition, for the scenarios used, there were standardised ECGs and arterial blood gas results for the students, if requested. Participants were given an identical structured introduction to SimMan® prior to the assessment. Two assessors (ZP and JK) were present for all assessments. One operated the software and provided the voice of SimMan® for history points. The other gave each student an introduction prior to the assessments, standard prompts during the assessment, if necessary, and also acted as an assistant able to perform clinical observations and cannulate.

Instrument

We designed two scenarios based on the assessment and management of acute coronary syndrome (ACS) and acute severe asthma (AA), lasting approximately 10 min each; these emergencies were chosen as they were felt to be easily simulated using the Laerdal SimMan®. A criterion-based checklist was developed for each scenario. The items in each checklist included aspects of airway, breathing and circulation assessment (ABC), eliciting pertinent history and examination findings, requesting and interpreting investigations and initiating basic management steps. We designed the checklist content to correspond with the relevant objectives of the medical school curriculum and also the ACUTE initiative, for the two scenarios chosen (Perkins et al. Citation2005). The trust's Clinical Ethics Committee deemed that formal ethical approval was not necessary.

To establish face validity, checklists and scenarios were circulated to 22 consultants involved in undergraduate teaching and emergency medicine, who were asked to indicate whether each task was appropriate for final year undergraduates. Consultants were asked to rank the tasks in order of importance on a three-point Likert scale; the mode of these answers (score 1–3) was taken as a score for each task and used to weight each component in order of importance. If more than 20% of the consultants felt a task was inappropriate, it was removed.

The checklist included one aspect of timed assessment (time taken to assess ABC). The checklist was completed independently by both assessors (ZP and JK) for all candidates.

A pilot was run with 12 final year students resulting in a number of minor changes: clinical information in the scenarios was changed slightly as some details were ambiguous; checklists were modified to include standard prompts the examiner should say if a task was not performed, e.g. ‘the oxygen saturations are still low’; and marking guidelines were produced for clarification of items where scoring had been troublesome, e.g. medications for which students were expected to know both dose and route in order to score marks. In addition, the assessment was stopped if not completed after 10 min, since most candidates in the pilot had completed the test in this time. This was primarily to increase the feasibility of using the tool, but also acted to prevent an extremely slow candidate scoring the same as an efficient one. The scoring system was changed to reflect the use of prompts, so the student could only score half marks for a correct item, if prompted.

Participants

To assess construct validity, both assessment tools were administered to three groups of volunteers with different experience levels; 20 third year (graduate entry, 4 year course) medical students, 18 final year students and 24 Foundation Year 2 doctors (FY2). All medical students involved had no previous exposure to SimMan®. Participants received both assessments on the same day; the order was alternated so that 50% of each group received the AA scenario first. All participants were given detailed feedback after their performance by an assessor or an independent observer, either on the same day or three days after the assessment. Anonymity was subsequently maintained using numbers to identify the individuals.

Analysis

We assessed interrater reliability using intraclass correlation. Internal consistency of both checklists was measured using Cronbach's alpha. The difference between three groups of experience level was calculated using one-way ANOVA. SPSS versions 12.0 and 15.0 were used for statistical analysis. A generalizability analysis was conducted using GENOVA version 3.1.

Results

One item was removed from the ACS checklist and two from the AA checklist following the assessment of face validity. The final checklists used (after consultant survey and pilots) are available at http://simulation.laerdal.com/forum/files/folders/user_scenarios/default.aspx

The mean scores for the ACS assessment were 25.1, 36.2 and 47.9 for the third year, final year and FY2, respectively (), out of a maximum score of 67. The mean score for the AA assessment was 28.6, 39.1 and 49.7, respectively, out of a maximum score of 72 (). The difference between all groups for both assessments was statistically significant (p < 0.001).

Figure 1. Acute Coronary Syndrome (ACS) Score across three groups of experience.

Figure 1. Acute Coronary Syndrome (ACS) Score across three groups of experience.

Figure 2. Acute Asthma (AA) Score across three groups of experience.

Figure 2. Acute Asthma (AA) Score across three groups of experience.

There was no significant difference between the sex distribution of the three groups (p = 0.495). Two of the FY2s had had brief exposure to SimMan® before. If these two individuals’ results were discounted as a possible source of bias, the mean of the FY2 scores were 47.76 (ACS) and 50.1 (AA), which remain significantly different to the other groups (p < 0.001).

The reliability measures are detailed in . Deletion of any item on either checklist did not result in any substantial improvement in Cronbach's alpha, and therefore, no items were removed.

Table 1.  Reliability results: Interrater reliability and internal consistency

A generalizability analysis was conducted using rater and case (scenario) as facets in the design (a two-facet crossed design). The three groups of experience level were analysed separately to minimize examinee variation. The summary of these results is tabulated in . The variance components represent error variance and their magnitudes reveal the importance of the various sources of error (Mushquash & O’Connor Citation2006). Using the data from the generalizability analysis, the G study, one can conduct a decision or D study to evaluate the effectiveness of alternative designs with differing numbers of facets. An example is shown in .

Table 2.  Variance component matrix from generalizability analysis

Table 3.  D study to examine effect on variance of more cases and number of assessors for each study group of candidates

Discussion

We have produced two instruments, which have demonstrable interrater reliability and face validity, and our findings of increased scores with experience are in support of construct validity. We have made available the scenarios, assessment forms, history points, marking guidance and ‘props’ (ABGs, ECGs) on the Laerdal Simulation User Network, http://simulation.laerdal.com/forum/files/folders/user_scenarios/default.aspx. Sharing assessment tools and scenarios among educators permits further assessment of reliability and validity, and encourages standardisation (Bond & Spillane Citation2002). Although there are ever increasing resources available on the web for use with simulators, this is the first study, to our knowledge, for the Laerdal SimMan® that has reported the validation process and made available all the materials necessary to reproduce the assessments, particularly the programmed scenarios.

The assessments are feasible and easy to administer, requiring two members of staff, and we found it possible to individually assess a group of 15 students in 2½ h (one assessment scenario). There is some rationale for time limitation in checklist assessment of scenarios that in real life require both rapid clinical reasoning and performance of clinical skills. Experts perform better on speeded up sensorimotor tasks where attention to execution is limited, in contrast to novices whose performance improves with additional time to attend to detail (Beilock et al. Citation2004).

The generalizability analysis shows that the largest variance component was for examinees, which is to be expected, and not a source of error (Mushquash & O’Connor Citation2006). The next largest variance component across all three groups, examinee × case, which indicates the rank ordering of examinees, differed across the two cases. The G coefficients reflect the reliability of the scores across raters and cases and are reasonably close to the conventional threshold of 0.80, for third and fourth year medical students. The G coefficients are based on relative, rather than absolute decisions; if absolute decisions are required, the reliability will be lower. Although this study contains small numbers for this type of analysis, the D study demonstrates that four cases with one rater would be desirable for a G coefficient of ≥0.8 for final year medical students, for whom the tool was designed. Boulet et al. (Citation2003) found that student performance did not generalize well from one case to another, supporting the notion that multiple cases are necessary. However, technical limitations of the SimMan® may prevent the whole range of medical emergencies in the acute care curriculum being sampled e.g. neurological. Even when further cases have been designed and evaluated, it would still be unsafe to assume that achieving a G coefficient of ≥0.8 across all the cases would ensure that students had been robustly assessed on their ability to manage any emergency.

Students have valued exposure of deficits in their ability to assess and treat acutely ill patients, and welcomed the idea of using SimMan® in end-of-year assessments (MacDowall Citation2006). In aiming to further the adoption of these instruments summatively, our study has a number of limitations. The first is the sampling bias of only using two scenarios, as discussed above. Second, the range of domains assessed within each scenario could have been expanded: For instance, we made no assessment of communication skills; while we acknowledge that simulation exercises are hugely important in teaching communication skills, we decided not to assess these in the interest of keeping our instrument simple, and because communication skills were not explicit in the objectives of our assessment. The nature of a checklist assessment prohibits the inclusion of complex cases, which may also be detrimental to content validity (Schuwirth & Van der Vleuten Citation2003). However, balance must be sought between authenticity and feasibility and the primary aim of this study was to produce a tool that is easy to administer. Standardization across cases could have been improved by having a set time before issuing a prompt for each item.

A further barrier to summative implementation may be inferred from the low ‘Cronbach's alpha’ measure of internal consistency achieved in this study and others (Devitt et al. Citation1998; Morgan et al. Citation2004). Values >0.7 are desirable for high-stakes assessment, and most values were below 0.5 in this study; furthermore, the reported values are likely to have been adversely affected by the large number of missing values for items, particularly in the third year students.

However, Cronbach's alpha should be interpreted with caution in checklist assessments where items are not random parallel, i.e. not randomly sampled from the total possible number of items, and not truly independent (Cronbach Citation2004). In our assessment, there may have been more items to represent ABC assessment, for example, than other domains, and performance in one item may have affected performance in another, and therefore, these assumptions have not been met. Furthermore, Cronbach indicated that alpha should not be used if a time limit has been set to a test so that part of the scores may equal zero (due to running out of time) (2004). In our study, only one of the fourth years (with no FY2s and eight of the third years) failed to complete one or both tests. Omission of the fourth year student's result who failed to complete increased Cronbach's alpha slightly (α = 0.349). Murray et al. (Citation2002) have been critical of previous researchers placing too much emphasis on item-item correlations in the assessment of internal consistency; to remove a test item based on statistical results without considering the clinical significance of the item may be sacrificing validity for reliability.

Our interrater reliability was found to be excellent and comparable with other studies (Morgan & Cleave-Hogg Citation2000b; Morgan et al. Citation2001b, Citation2004). This was probably influenced by our action after the pilot study in producing prompt sheets for markers; similar observations have been noted in previous work (Murray et al. Citation2002). The exclusion of behavioural aspects such as communication skills, which are likely to be difficult to measure, is also likely to have increased measured interrater reliability. Interrater reliability in this study is based on the results of two authors who were clearly intrinsically involved in the scenario design. We have, however, measured interrater reliability between one author and a member of faculty not involved in the study on a small group of students (11 final year students, AA checklist) with r = 0.890; this suggests that the interrater reliability could be generalised to other assessors. We could have tested interrater reliability amongst ‘non-experts’. Boulet et al. (Citation2003) found little difference between nurse clinicians and faculty members in rating students on criterion checklists. An advantage of using experts is that a global judgement of overall performance can be incorporated into the scoring strategy. This has been found to have equivalent reliability to checklists (Morgan et al. Citation2001a) and as suggested may yield more valid results (Boulet et al. Citation2003). Again, we elected not to do this so that non-experts could rate, although this still needs evaluation.

Face validity was established among 22 of the teaching faculty. We could have also surveyed the students’ views of the assessment; other studies report positive evaluations (Morgan & Cleave-Hogg Citation2000b; Weller et al. Citation2004). Face validity is often overlooked, but it is intrinsically linked to a student's motivation in taking a test and therefore, is of importance in assessment design (Guilford Citation1954). Clearly, the assessment itself should not be considered in isolation, and it is the feedback which the student receives after the assessment which is key. Further formal evaluation of this feedback would add value to the design of the assessment process.

With the proviso that the raters were not blinded to the level of experience of the participants, the scenarios and checklists can be said to measure a construct that increases with medical experience and we can infer that acute care skills would also improve with experience. However, we cannot state from this study alone that we have measured the construct of acute care skills without further work such as confirmatory factor analysis. Clearly, fully evaluating construct validity in the context of emergency medicine is problematic, not least due to the difficulty in defining the construct.

Conclusion

In conclusion, we have demonstrated the reliability and validity of two user-friendly assessment tools using the Laerdal SimMan®. With the dissemination of these results and materials necessary to reproduce the assessments, other medical schools may adopt these instruments for formative use. Further work to expand the range of scenarios may enable these assessments to be incorporated in summative examinations, and our G study results suggest that four scenarios would provide a robust measure, although this would need further evaluation. Perhaps more importantly, the question still remains as to whether simulator training makes better doctors or not; evaluating the predictive validity of simulator use remains somewhat of a holy grail for medical education researchers.

Acknowledgements

The authors would like to acknowledge the help and assistance of Dr Lois Brand and Dr Denis Lindo with data collection and Professor David Wall for his help with statistical analyses.

Declaration of interest: The authors report no conflicts of interest. The authors alone are responsible for the content and writing of the paper.

Additional information

Notes on contributors

Zoe Paskins

ZOE PASKINS (ZP) and JO KIRKCALDY (JK) conceived the study, designed the assessments and collected the data. ZP undertook the statistical analyses and wrote the initial draft of the paper.

Jo Kirkcaldy

MAGGIE ALLEN, IAN FRASER and COLIN MACDOUGALL helped refine the study design and edit the final draft.

Maggie Allen

ED PEILE supervised the study and contributed to the final draft.

Colin Macdougall

ED PEILE supervised the study and contributed to the final draft.

Ian Fraser

ED PEILE supervised the study and contributed to the final draft.

Ed Peile

ED PEILE supervised the study and contributed to the final draft.

References

  • Beilock SL, Bertenthal BI, McLay AM, Carr TH. Haste does not always make waste: Expertise, direction of attention, and speed versus accuracy in performing psychomotor skills. Psychon Bull Rev 2004; 11: 373–379
  • Bond WF, Spillane L. The use of simulation for emergency medicine resident assessment. Acad Emerg Med 2002; 9(11)1295–1299
  • Boulet JR, Murray D, Kras J, Woodhouse J, McAllister J, Ziv A. Reliability and validity of a simulation-based acute care skills assessment for medical students and residents. Anesthesiology 2003; 99(6)1270–1280
  • Bradley P. The history of simulation in medical education and possible future directions. Med Educ 2006; 40(3)254–262
  • Byrne AJ, Greaves JD. Assessment instruments used during anaesthetic simulation: Review of published studies. Br J Anaesth 2001; 86(3)445–450
  • Cronbach LJ. My current thoughts on coefficient alpha and successor procedures. Educ Psychol Meas 2004; 64(3)391–418
  • Cullinane M, Findlay G, Hargraves C, Lucas S, 2005. National Confidential Enquiry into Patient Outcome and Death. An Acute Problem [Internet]. Available from http://www.ncepod.org.uk/2005report/. Accessed 25 October 2008
  • Devitt JH, Kurrek MM, Cohen MM, Fish K, Fish P, Noel AG, Szalai JP. Testing internal consistency and construct validity during evaluation of performance in a patient simulator. Anesth Analg 1998; 86(6)1160–1164
  • Devitt JH, Kurrek MM, Cohen MM, Cleave-Hogg D. The validity of performance assessments using simulation. Anesthesiology 2001; 95(1)36–42
  • Franklin C, Mathew J. Developing strategies to prevent in-hospital cardiac arrest: Analysing responses of physicians and nurses in the hours before the event. Crit Care Med 2002; 22: 244–247
  • General Medical Council. Tomorrow's doctors. General Medical Council, London 2003
  • Gordon JA, Tancredi DN, Binder WD, Wilkerson WM, Shaffer DW. Assessment of a clinical performance evaluation tool for use in a simulator-based testing environment: A pilot study. Acad Med 2003; 78(Suppl 10)S45–47
  • Guilford JP. Psychometric methods, 2nd. McGraw-Hill Book Co, New York 1954
  • Hodgetts TJ, Kenward G, Vlackonikolis I, Payne S, Castle N, Crouch R, Ineson N, Shaikh L. Incidence, location and reasons for avoidable in-hospital cardiac arrest in a district general hospital. Resuscitation 2002; 54: 115–123
  • MacDowall J. The assessment and treatment of the acutely ill patient – The role of the patient simulator as a teaching tool in the undergraduate programme. Med Teach 2006; 28(4)326–329
  • Maran NJ, Glavin RJ. Low- to high-fidelity simulation – A continuum of medical education?. Med Educ 2003; 37(Suppl 1)22–28
  • McQuillan P, Pilkington S, Allan A, Taylor B, Short A, Morgan G, Nielson M, Barrett D, Smith G. Confidential inquiry into quality of care before admission to intensive care. BMJ 1998; 316: 1853–1858
  • Miller GE. The assessment of clinical skills/competence/performance. Acad Med 1990; 65(9)S63–67
  • Morgan PJ, Cleave-Hogg D. A Canadian simulation experience: Faculty and student opinions of a performance evaluation study. Br J Anaesth 2000a; 85(5)779–781
  • Morgan PJ, Cleave-Hogg D. Evaluation of medical students’ performance using the anaesthesia simulator. Med Educ 2000b; 34(1)42–45
  • Morgan PJ, Cleave-Hogg D, Guest GB. A comparison of global ratings and checklist scores from an undergraduate assessment using an anesthesia simulator. Acad Med 2001a; 76(10)1053–1055
  • Morgan PJ, Cleave-Hogg DM, Guest CB, Herold J. Validity and reliability of undergraduate performance assessments in an anesthesia simulator. Can J Anaesth 2001b; 48(3)225–233
  • Morgan PJ, Cleave-Hogg D, DeSousa S, Tarshis J. High-fidelity patient simulation: Validation of performance checklists. Br J Anaesth 2004; 92(3)388–392
  • Murray D, Boulet J, Ziv A, Woodhouse J, Kras J, Mcallister J. An acute care skills evaluation for graduating medical students: A pilot study using clinical simulation. Med Educ 2002; 36(9)833–841
  • Mushquash C, O’Connor BP. SPSS and SAS programs for generalizability theory analyses. Behav Res Methods 2006; 38(3)542–547
  • Perkins GD, Barret H, Bullock I, Gabbott DA, Nolan JP, Mitchell S, Short A, Smith CM, Smith GB, Todd S, Bion JF. The Acute Care Undergraduate TEaching (ACUTE) initiative: Consensus development of core competencies in acute care for undergraduates in the United Kingdom. Intensive Care Med 2005; 31(12)1627–1633
  • Schuwirth LWT, Van der Vleuten CPM. The use of clinical simulations in assessment. Med Educ 2003; 37(Suppl 1)65–71
  • Weller J, Robinson B, Larsen P, Caldwell C. Simulation-based training to improve acute care skills in medical undergraduates. N Z Med J 2004; 117(1204)U1119

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.