1,385
Views
8
CrossRef citations to date
0
Altmetric
Web Papers

‘Hawks’ and ‘doves’: effect of feedback on grades awarded by supervisors of student selected components

, , &
Pages e484-e488 | Published online: 30 Oct 2009

Abstract

Background: Supervisors of some student selected components (SSCs) may appear to give higher grades than others. It is not known if feedback can influence the behaviour of supervisors in the grades they award. We have introduced feedback letters in our institution.

Aims: (1) To assess the feasibility of objectively identifying SSCs where grades awarded are consistently higher or lower than the average; (2) To assess the effect of feedback on the grades awarded by supervisors of SSCs.

Methods: The breakdown of SSC grades was examined over four consecutive years, before and after feedback letters were introduced in 2005. The grades awarded globally, and in five individual SSCs, were compared using the χ2 goodness-of-fit test.

Results: (1) Individual SSCs were identified which awarded grades that were consistently different from the average. (2) Overall grades awarded in 2003/04 and 2004/05 (before feedback) were similar (χ2= 0.37, df = 2, p = 0.83). Likewise, overall grades awarded in 2005/06 and 2006/07 (after feedback) were similar (χ2= 1.72, df = 2, p = 0.42). Comparison of 2003/04 with 2005/06 (χ2= 16.0, df = 2, p < 0.001), and 2006/07 (χ2= 26.6, df = 2, p < 0.001), and of 2004/05 with 2005/06 (χ2= 13.5, df = 2, p = 0.001), and 2006/07 (χ2= 23.7, df = 2, p < 0.001), revealed highly significant differences. The grades awarded after feedback were higher than the grades awarded before feedback.

Conclusions: The χ2 goodness-of-fit test may be used to identify individual SSCs where the grades awarded are different from the average, although the interpretation of the results thus obtained is fraught with difficulty. Our data also suggest that it is possible to influence assessors in the grades they award.

Introduction

It is well recognised that different examiners, assessing the same students, may vary in the grades they award (Pitts et al. Citation1999; Daelmans et al. Citation2005; Beckman et al. Citation2006). This phenomenon is most readily observed by staff involved in the central review of results. The ability to set robust pass marks, particularly in summative assessments of clinical competence such as objective structured clinical examinations (OSCEs), is clearly important, and is made more difficult by gender interactions and by the lack of training of assessors (Pell et al. Citation2008). Even selecting candidates for medical school is challenging; several studies have examined the ability of assessors reliably to evaluate non-cognitive attributes of prospective medical students (Ziv et al. Citation2008; Donnon et al. Citation2009). Other studies have focused on inter-rater reliability in a variety of clinical and educational contexts (Metheny, Citation1991; Morgan et al. Citation2001; Olson et al. Citation2003; Burnett et al. Citation2008; Anderson et al. Citation2008). In most but not all (Metheny Citation1991) of these studies, inter-rater reliability has been high, or at least acceptable, although this may reflect publication bias, and the experience and training of assessors. Indeed the importance of training and experience of assessors has rightly been emphasised (Pell et al. Citation2008; Johnson Citation2008). Untrained or less experienced assessors tend to give higher marks (Chenot et al. Citation2007; Pell et al. Citation2008)

The existence of inter-rater variation raises two related questions. First, is it possible objectively to identify assessors who award grades that appear to be consistently higher or lower than the average (‘doves’ and ‘hawks’)? Second, is it possible to influence the behaviour of assessors in the grades they award? The answers to these questions have implications for the standardisation of assessment. This issue is particularly problematic for student selected components (SSCs) (modules selected by students from a menu of options provided by staff), because of the heterogeneous nature of most SSC programmes. Traditional approaches to standard-setting, e.g. the widely used modified Angoff method, require panels of expert judges (Friedman Ben-David Citation2000), and it is difficult to justify the application of such resource-intensive methods to individual SSCs, many of which are undertaken by comparatively small numbers of students.

In Dundee, we have since 2005 provided feedback to supervisors of SSCs in the form of a simple comparison of grades awarded in their SSC with average grades awarded in the programme as a whole. In the current study, we took advantage of the opportunity presented by this intervention to seek answers to the two questions posed above. We recognised a priori that the heterogeneity within our SSC programme would make it difficult to provide a clear answer to the first question. We nevertheless wished to investigate the feasibility of examining this issue in relation to SSCs.

Background

SSCs completed by medical students in Phase 2 of the Dundee curriculum form the basis of the current study. We have previously described this programme in detail (Murphy et al. Citation2008). Briefly, Phase 2 consisted of the second and third years of a five-year curriculum. Across the two years, sixteen weeks were devoted to SSCs (four-week blocks in January and May of each year). Total numbers of SSCs in the Phase 2 programme in the four years of the study ranged from 65 to 74. All SSCs were either two or four weeks long. The range of SSCs completed was varied. Some covered core topics in more depth; others covered medical topics related to the core; yet others covered topics less directly related to medicine. A sample list of SSCs for 2005/06 can be found at the following web link: http://www.dundee.ac.uk/meded/frames/SSCResearch.html.

Methods

The period of study consisted of the two academic years before feedback letters were introduced (2003/04 and 2004/05), and the two years after (2005/06 and 2006/07). SSCs were included if they met the following criteria:

  1. They had been running for at least three years prior to the period of study (to minimise the potential effect of inexperience of the programme on grades awarded);

  2. They ran throughout the entire period of study;

  3. There were no major changes to the staffing or design of the SSC during the period of study;

  4. They were usually run for at least twenty students (to reduce sampling bias).

A letter was sent to the named supervisor of each SSC at the end of 2005/06 and 2006/07 academic years. An example is shown in In every case, the named supervisor was the assessor of the module.

Figure 1. Feedback letter sent to supervisors of student selected components.

Figure 1. Feedback letter sent to supervisors of student selected components.

The overall breakdown of grades awarded in the Phase 2 SSC programme for each of the four chronological years studied (the four ‘global datasets’) were compared using the χ2 goodness-of-fit test. For each chronological year the grades awarded for five individual SSCs were compared with the overall breakdown of grades for that year, also using the χ2 goodness-of-fit test. The probabilities that individual χ2 statistics were obtained by chance were expressed as p values.

The topics covered by the five SSCs were as follows: surgical studies; perspectives on medical advances; gastrointestinal endoscopy; pathology and investigation of violence; forensic science. Each of these SSCs is referred to by a number (1 to 5). The identity of individual SSCs was preserved throughout; thus ‘SSC1’ is the same SSC in each of the four chronological years covered by the study.

Results

Global comparisons: shows the overall grades awarded in the Phase 2 SSC programme over the four chronological years of the study (the four ‘global datasets’). In the two years immediately preceding the introduction of feedback letters (2003/04 and 2004/05) the grades were similar (χ2= 0.37, df = 2, p = 0.83). Likewise, overall grades awarded in the first two years after feedback letters were introduced (2005/06 and 2006/07) were similar (χ2= 1.72, df = 2, p = 0.42). However, separate comparisons of 2003/04 with 2005/06 (χ2= 16.0, df = 2, p < 0.001), and 2006/07 (χ2= 26.6, df = 2, p < 0.001), revealed highly significant differences; this applied also to separate comparisons of 2004/05 with 2005/06 (χ2= 13.5, df = 2, p = 0.001), and 2006/07 (χ2= 23.7, df = 2, p < 0.001). In summary, the grades awarded by supervisors post feedback were significantly higher than the grades awarded pre feedback.

Table 1.  Comparison of grades awarded in the Phase 2 programme over four consecutive years

Individual comparisons: Comparison of grades awarded for individual SSCs with relevant global datasets is shown in . Individual SSCs were identified which awarded grades that were consistently different from the average. For example the grades awarded in SSC3 were significantly lower than average in each year (although χ2 test was not performed for SSC3 for the 2003/04 dataset because one of the cell counts was zero). For other SSCs a consistent pattern was not observed. For example, for SSC4 the grades awarded in 2003/04 were lower than average, in 2004/05 higher than average, in 2005/06 not significantly different from average, in 2006/07 higher than average.

Table 2.  Breakdown of grades awarded over four years in Phase 2 SSC programme and in selected SSCs

Discussion

In this report we have sought answers to two questions. First, is it possible objectively to identify assessors or modules who award grades that are consistently higher or lower than the average? Second, is it possible to influence the behaviour of assessors in the grades they award?

We compared the grades awarded by selected individual SSCs with those awarded in the programme as a whole, over four consecutive years. In each case, we used the χ2 statistic as a summary measure of deviation from the average. We sought to identify SSCs where the grades awarded were consistently higher or lower than the average. Although the grades awarded by some SSCs, e.g. SSC3, appeared to be lower than average, in other cases a clear consistent pattern was not observed over the four years of the study. However, these data must be interpreted with caution. The observation of higher or lower grades than average does not per se imply the application of a different standard of assessment from the average. Precisely because of the heterogeneity of SSCs, there are multiple alternative explanations; indeed we highlighted this point in the feedback letters. The purpose of the analysis was to identify SSCs where the application of a different standard of assessment from the average is simply one possible explanation.

The grades awarded in the SSC programme as a whole (the ‘global datasets’) were clearly different in the two years after the feedback letters were introduced compared with the two years before. In the absence of any major changes to individual courses, or to methods of assessment, during the period of the study, it seems likely that the observed differences between the grades awarded pre and post reflect a response to the feedback letters. The SSC supervisors, as a group, awarded higher grades post feedback than they had previously. This finding was not anticipated, and begs the question of what might reasonably have been expected. The letters were sent as part of a programme of good educational governance, with the explicit aim of making the assessment process more transparent, but not with the explicit aim of altering the behaviour of the recipients. It might have been anticipated that the behaviour of assessors whose grades were obviously different from the mean might have been affected in such a way as to bring them closer to the mean. We did not find any convincing evidence of this, although we only examined a small cohort of SSCs. Moreover, the SSCs included in the analysis were selected on the basis of their size, in order to reduce sampling bias in the calculation of individual χ2 statistics. The majority of SSCs were run for smaller numbers of students, and it is possible that smaller SSCs behave differently from larger ones with respect to these analyses.

Limitations of the analysis

This study illustrates some of the difficulties in using the χ2 goodness-of-fit test to compare grades awarded by different SSCs. The χ2 statistics shown in , and the associated p values, must be interpreted with caution. First, the probability that the observed χ2 statistic has arisen by chance, reflects the size of the SSC as well as the pattern of grades awarded. This applies in particular to SSC4, by far the largest SSC studied. Second, the pattern observed was not always consistent; for example, the grades awarded by SSC4 were lower, then higher, than average in the two years before feedback was introduced. Third, the requirements of the χ2 goodness-of-fit test meant substantial ‘merging’ of grade categories, resulting in the fairly crude categorisation shown in . Fourth, the application of Yates’ correction may have ‘over-compensated’ for the small cell counts in some instances, and masked true differences. A larger, and longer, study will be required in order to establish consistency or otherwise in the patterns of grades awarded by these SSCs. This analysis merely provides the framework for further studies.

We cannot exclude the possibility that smaller SSCs behave differently from larger ones with respect to the analyses we have performed. Further studies are needed to examine the grades awarded by supervisors (assessors) of small SSCs compared with the grades awarded by supervisors of large SSCs. It is entirely plausible to hypothesise that these grades might be different; the number of students undertaking an SSC is likely to have a material impact on the interaction between individual students and supervisors (and other teaching staff), and hence the supervisors’ assessment.

Implications for research and practice

Our findings provide the basis for further studies, in relation to both of the research questions posed. In relation to the first question (is it possible objectively to identify assessors who award grades that appear to be consistently higher or lower than the average (‘doves’ and ‘hawks’)?), longer and larger studies are required to establish the consistency or otherwise of assessors’ behaviour, i.e to identify true hawks and doves. Given the acknowledged difficulty imposed by the heterogeneous nature of SSC programmes, it may be more profitable to examine this behaviour in other forums of assessment, e.g. OSCEs, or standard-setting panels. In relation to the second (is it possible to influence the behaviour of assessors in the grades they award?), further research should aim to establish interventions (feedback) with explicit aims and to examine if the behaviour of assessors can be influenced in a predictable fashion. On a wider note, additional research might focus on the grades awarded by different academic departments or divisions within medical schools.

Conclusions

Our findings suggest that the χ2 statistic may in principle be used to identify individual assessors or SSCs where the grades awarded are different from the average, although the interpretation of the results thus obtained is fraught with difficulty. Our finding that the grades awarded by supervisors post feedback were significantly higher than the grades awarded pre feedback suggests that it is possible to influence assessors in the grades they award.

Acknowledgements

The authors wish to acknowledge the contribution of staff and students at the University of Dundee Medical School without whose co-operation this study would not have been possible.

Declaration of interest: The authors report no conflicts of interest. The authors alone are responsible for the content and writing of the paper.

Additional information

Notes on contributors

Michael J. Murphy

Michael J Murphy, BA(Mod), MB BCh BAO, FRCP Edin, FRCPath is Senior Lecturer in Biochemical Medicine, and SSC Convenor, at the University of Dundee, Scotland, UK.

Rohini De A. Seneviratne

Rohini De Alwis Seneviratne, MD, MBBS, MMEd, FCCP(SL) is Professor in Community Medicine, Faculty of Medicine, University of Colombo, Sri Lanka.

Olga J. Remers

Olga J Remers, BSc, MSc is Assessment Administrative Assistant (SSCs) at the University of Dundee Medical School.

Margery H. Davis

Margery H Davis, MD, MB ChB, FRCP is Director of the Centre for Medical Education and Professor of Medical Education, University of Dundee.

References

  • Anderson K, Peterson R, Tonkin A, Cleary E. The assessment of student reasoning in the context of a clinically oriented PBL program. Med Teach 2008; 30: 787–94
  • Beckman TJ, Cook DA, Mandrekar JN. Factor instability of clinical teaching assessment scores among general internists and cardiologists. Med Educ 2006; 40: 1209–1216
  • Burnett E, Phillips G, Ker JS. From theory to practice in learning about healthcare associated infections: Reliable assessment of final year medical students' ability to reflect. Med Teach 2008; 30: e157–60
  • Chenot JF, Simmenroth-Nayda A, Koch A, Fischer T, Scherer M, Emmert B, Stanske B, Kochen MM, Himmel W. Can student tutors act as examiners in an objective structured clinical examination?. Med Educ 2007; 41: 1032–8
  • Daelmans HEM, van der Hem-Stokroos HH, Hoogenboom RJI, Scherpbier AJJA, Stehouwer CDA, van der Vleuten CPM. Global clinical performance rating, reliability and validity in an undergraduate clerkship. Neth J Med 2005; 63: 279–284
  • Donnon T, Oddone-Paolucci E, Violato C. A predictive validity study of medical judgment vignettes to assess students' noncognitive attributes: A 3-year prospective longitudinal study. Med Teach 2009; 31: e148–55
  • Friedman Ben-David M. AMEE Medical Education Guide 18. Standard setting in student assessment. Med Teach 2000; 22: 120–130
  • Johnson M. Exploring assessor consistency in a health and social care qualification using a sociocultural perspective. J Voc Educ Train 2008; 60: 173–87
  • Metheny WP. Limitations of physician ratings in the assessment of student clinical performance in an obstetrics and gynecology clerkship. Obstet Gynecol 1991; 78: 136–41
  • Morgan PJ, Cleave-Hogg D, Guest CB. A comparison of global ratings and checklist scores from an undergraduate assessment using an anesthesia simulator. Acad Med 2001; 76: 1053–5
  • Murphy MJ, Seneviratne RdeA, McAleer SP, Remers OJ, Davis MH. Do students learn what teachers think they teach?. Med Teach 2008; 30: e175–e179
  • Olson L, Schieve AD, Ruit KG, Vari RC. Measuring inter-rater reliability of the sequenced performance inventory and reflective assessment of learning (SPIRAL). Acad Med 2003; 78: 844–50
  • Pell G, Homer MS, Roberts TE. Assessor training: Its effects on criterion-based assessment in a medical context. Int J Res & Meth Educ 2008; 31: 143–154
  • Pitts J, Coles C, Thomas P. Educational portfolios in the assessment of general practice trainers: Reliability of assessors. Med Educ 1999; 33: 515–520
  • Ziv A, Rubin O, Moshinsky A, Gafni N, Kotler M, Dagan Y, Lichtenberg D, Mekori YA, Mittelman M. MOR: A simulation-based assessment centre for evaluating the personal and interpersonal qualities of medical school candidates. Med Educ 2008; 42: 991–8

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.