Search in:

Medical Teacher Volume 31, 2009 - Issue 10

Submit an article Journal homepage

Free access

1,385

Views

CrossRef citations to date

Altmetric

Listen

Web Papers

‘Hawks’ and ‘doves’: effect of feedback on grades awarded by supervisors of student selected components

Michael J. Murphy Centre for Academic Clinical Practice, UKCorrespondence[email protected]
View further author information

Rohini De A. Seneviratne Centre for Medical Education, UKView further author information

Olga J. Remers Medical School Office, University of Dundee, UKView further author information

Margery H. Davis Centre for Medical Education, UKView further author information

Pages e484-e488 | Published online: 30 Oct 2009

Cite this article
https://doi.org/10.3109/01421590903258670

In this article

Introduction
Background
Methods
Results
Discussion
Limitations of the analysis
Implications for research and practice
Conclusions
Acknowledgements
Additional information
References

Full Article
Figures & data
References
Citations
Metrics
Reprints & Permissions
View PDF PDF

Abstract

Background: Supervisors of some student selected components (SSCs) may appear to give higher grades than others. It is not known if feedback can influence the behaviour of supervisors in the grades they award. We have introduced feedback letters in our institution.

Aims: (1) To assess the feasibility of objectively identifying SSCs where grades awarded are consistently higher or lower than the average; (2) To assess the effect of feedback on the grades awarded by supervisors of SSCs.

Methods: The breakdown of SSC grades was examined over four consecutive years, before and after feedback letters were introduced in 2005. The grades awarded globally, and in five individual SSCs, were compared using the χ² goodness-of-fit test.

Results: (1) Individual SSCs were identified which awarded grades that were consistently different from the average. (2) Overall grades awarded in 2003/04 and 2004/05 (before feedback) were similar (χ²= 0.37, df = 2, p = 0.83). Likewise, overall grades awarded in 2005/06 and 2006/07 (after feedback) were similar (χ²= 1.72, df = 2, p = 0.42). Comparison of 2003/04 with 2005/06 (χ²= 16.0, df = 2, p < 0.001), and 2006/07 (χ²= 26.6, df = 2, p < 0.001), and of 2004/05 with 2005/06 (χ²= 13.5, df = 2, p = 0.001), and 2006/07 (χ²= 23.7, df = 2, p < 0.001), revealed highly significant differences. The grades awarded after feedback were higher than the grades awarded before feedback.

Conclusions: The χ² goodness-of-fit test may be used to identify individual SSCs where the grades awarded are different from the average, although the interpretation of the results thus obtained is fraught with difficulty. Our data also suggest that it is possible to influence assessors in the grades they award.

Introduction

It is well recognised that different examiners, assessing the same students, may vary in the grades they award (Pitts et al. Citation1999; Daelmans et al. Citation2005; Beckman et al. Citation2006). This phenomenon is most readily observed by staff involved in the central review of results. The ability to set robust pass marks, particularly in summative assessments of clinical competence such as objective structured clinical examinations (OSCEs), is clearly important, and is made more difficult by gender interactions and by the lack of training of assessors (Pell et al. Citation2008). Even selecting candidates for medical school is challenging; several studies have examined the ability of assessors reliably to evaluate non-cognitive attributes of prospective medical students (Ziv et al. Citation2008; Donnon et al. Citation2009). Other studies have focused on inter-rater reliability in a variety of clinical and educational contexts (Metheny, Citation1991; Morgan et al. Citation2001; Olson et al. Citation2003; Burnett et al. Citation2008; Anderson et al. Citation2008). In most but not all (Metheny Citation1991) of these studies, inter-rater reliability has been high, or at least acceptable, although this may reflect publication bias, and the experience and training of assessors. Indeed the importance of training and experience of assessors has rightly been emphasised (Pell et al. Citation2008; Johnson Citation2008). Untrained or less experienced assessors tend to give higher marks (Chenot et al. Citation2007; Pell et al. Citation2008)

Table

Download CSV Display Table

The existence of inter-rater variation raises two related questions. First, is it possible objectively to identify assessors who award grades that appear to be consistently higher or lower than the average (‘doves’ and ‘hawks’)? Second, is it possible to influence the behaviour of assessors in the grades they award? The answers to these questions have implications for the standardisation of assessment. This issue is particularly problematic for student selected components (SSCs) (modules selected by students from a menu of options provided by staff), because of the heterogeneous nature of most SSC programmes. Traditional approaches to standard-setting, e.g. the widely used modified Angoff method, require panels of expert judges (Friedman Ben-David Citation2000), and it is difficult to justify the application of such resource-intensive methods to individual SSCs, many of which are undertaken by comparatively small numbers of students.

In Dundee, we have since 2005 provided feedback to supervisors of SSCs in the form of a simple comparison of grades awarded in their SSC with average grades awarded in the programme as a whole. In the current study, we took advantage of the opportunity presented by this intervention to seek answers to the two questions posed above. We recognised a priori that the heterogeneity within our SSC programme would make it difficult to provide a clear answer to the first question. We nevertheless wished to investigate the feasibility of examining this issue in relation to SSCs.

Background

SSCs completed by medical students in Phase 2 of the Dundee curriculum form the basis of the current study. We have previously described this programme in detail (Murphy et al. Citation2008). Briefly, Phase 2 consisted of the second and third years of a five-year curriculum. Across the two years, sixteen weeks were devoted to SSCs (four-week blocks in January and May of each year). Total numbers of SSCs in the Phase 2 programme in the four years of the study ranged from 65 to 74. All SSCs were either two or four weeks long. The range of SSCs completed was varied. Some covered core topics in more depth; others covered medical topics related to the core; yet others covered topics less directly related to medicine. A sample list of SSCs for 2005/06 can be found at the following web link: http://www.dundee.ac.uk/meded/frames/SSCResearch.html.

Methods

The period of study consisted of the two academic years before feedback letters were introduced (2003/04 and 2004/05), and the two years after (2005/06 and 2006/07). SSCs were included if they met the following criteria:

They had been running for at least three years prior to the period of study (to minimise the potential effect of inexperience of the programme on grades awarded);
They ran throughout the entire period of study;
There were no major changes to the staffing or design of the SSC during the period of study;
They were usually run for at least twenty students (to reduce sampling bias).

A letter was sent to the named supervisor of each SSC at the end of 2005/06 and 2006/07 academic years. An example is shown in In every case, the named supervisor was the assessor of the module.

Figure 1. Feedback letter sent to supervisors of student selected components.

The overall breakdown of grades awarded in the Phase 2 SSC programme for each of the four chronological years studied (the four ‘global datasets’) were compared using the χ² goodness-of-fit test. For each chronological year the grades awarded for five individual SSCs were compared with the overall breakdown of grades for that year, also using the χ² goodness-of-fit test. The probabilities that individual χ² statistics were obtained by chance were expressed as p values.

The topics covered by the five SSCs were as follows: surgical studies; perspectives on medical advances; gastrointestinal endoscopy; pathology and investigation of violence; forensic science. Each of these SSCs is referred to by a number (1 to 5). The identity of individual SSCs was preserved throughout; thus ‘SSC1’ is the same SSC in each of the four chronological years covered by the study.

Results

Global comparisons: shows the overall grades awarded in the Phase 2 SSC programme over the four chronological years of the study (the four ‘global datasets’). In the two years immediately preceding the introduction of feedback letters (2003/04 and 2004/05) the grades were similar (χ²= 0.37, df = 2, p = 0.83). Likewise, overall grades awarded in the first two years after feedback letters were introduced (2005/06 and 2006/07) were similar (χ²= 1.72, df = 2, p = 0.42). However, separate comparisons of 2003/04 with 2005/06 (χ²= 16.0, df = 2, p < 0.001), and 2006/07 (χ²= 26.6, df = 2, p < 0.001), revealed highly significant differences; this applied also to separate comparisons of 2004/05 with 2005/06 (χ²= 13.5, df = 2, p = 0.001), and 2006/07 (χ²= 23.7, df = 2, p < 0.001). In summary, the grades awarded by supervisors post feedback were significantly higher than the grades awarded pre feedback.

Table 1. Comparison of grades awarded in the Phase 2 programme over four consecutive years

Display Table

Individual comparisons: Comparison of grades awarded for individual SSCs with relevant global datasets is shown in . Individual SSCs were identified which awarded grades that were consistently different from the average. For example the grades awarded in SSC3 were significantly lower than average in each year (although χ² test was not performed for SSC3 for the 2003/04 dataset because one of the cell counts was zero). For other SSCs a consistent pattern was not observed. For example, for SSC4 the grades awarded in 2003/04 were lower than average, in 2004/05 higher than average, in 2005/06 not significantly different from average, in 2006/07 higher than average.

Table 2. Breakdown of grades awarded over four years in Phase 2 SSC programme and in selected SSCs

Download CSV Display Table

Discussion

In this report we have sought answers to two questions. First, is it possible objectively to identify assessors or modules who award grades that are consistently higher or lower than the average? Second, is it possible to influence the behaviour of assessors in the grades they award?

We compared the grades awarded by selected individual SSCs with those awarded in the programme as a whole, over four consecutive years. In each case, we used the χ² statistic as a summary measure of deviation from the average. We sought to identify SSCs where the grades awarded were consistently higher or lower than the average. Although the grades awarded by some SSCs, e.g. SSC3, appeared to be lower than average, in other cases a clear consistent pattern was not observed over the four years of the study. However, these data must be interpreted with caution. The observation of higher or lower grades than average does not per se imply the application of a different standard of assessment from the average. Precisely because of the heterogeneity of SSCs, there are multiple alternative explanations; indeed we highlighted this point in the feedback letters. The purpose of the analysis was to identify SSCs where the application of a different standard of assessment from the average is simply one possible explanation.

The grades awarded in the SSC programme as a whole (the ‘global datasets’) were clearly different in the two years after the feedback letters were introduced compared with the two years before. In the absence of any major changes to individual courses, or to methods of assessment, during the period of the study, it seems likely that the observed differences between the grades awarded pre and post reflect a response to the feedback letters. The SSC supervisors, as a group, awarded higher grades post feedback than they had previously. This finding was not anticipated, and begs the question of what might reasonably have been expected. The letters were sent as part of a programme of good educational governance, with the explicit aim of making the assessment process more transparent, but not with the explicit aim of altering the behaviour of the recipients. It might have been anticipated that the behaviour of assessors whose grades were obviously different from the mean might have been affected in such a way as to bring them closer to the mean. We did not find any convincing evidence of this, although we only examined a small cohort of SSCs. Moreover, the SSCs included in the analysis were selected on the basis of their size, in order to reduce sampling bias in the calculation of individual χ² statistics. The majority of SSCs were run for smaller numbers of students, and it is possible that smaller SSCs behave differently from larger ones with respect to these analyses.

Limitations of the analysis

This study illustrates some of the difficulties in using the χ² goodness-of-fit test to compare grades awarded by different SSCs. The χ² statistics shown in , and the associated p values, must be interpreted with caution. First, the probability that the observed χ² statistic has arisen by chance, reflects the size of the SSC as well as the pattern of grades awarded. This applies in particular to SSC4, by far the largest SSC studied. Second, the pattern observed was not always consistent; for example, the grades awarded by SSC4 were lower, then higher, than average in the two years before feedback was introduced. Third, the requirements of the χ² goodness-of-fit test meant substantial ‘merging’ of grade categories, resulting in the fairly crude categorisation shown in . Fourth, the application of Yates’ correction may have ‘over-compensated’ for the small cell counts in some instances, and masked true differences. A larger, and longer, study will be required in order to establish consistency or otherwise in the patterns of grades awarded by these SSCs. This analysis merely provides the framework for further studies.

We cannot exclude the possibility that smaller SSCs behave differently from larger ones with respect to the analyses we have performed. Further studies are needed to examine the grades awarded by supervisors (assessors) of small SSCs compared with the grades awarded by supervisors of large SSCs. It is entirely plausible to hypothesise that these grades might be different; the number of students undertaking an SSC is likely to have a material impact on the interaction between individual students and supervisors (and other teaching staff), and hence the supervisors’ assessment.

Implications for research and practice

Our findings provide the basis for further studies, in relation to both of the research questions posed. In relation to the first question (is it possible objectively to identify assessors who award grades that appear to be consistently higher or lower than the average (‘doves’ and ‘hawks’)?), longer and larger studies are required to establish the consistency or otherwise of assessors’ behaviour, i.e to identify true hawks and doves. Given the acknowledged difficulty imposed by the heterogeneous nature of SSC programmes, it may be more profitable to examine this behaviour in other forums of assessment, e.g. OSCEs, or standard-setting panels. In relation to the second (is it possible to influence the behaviour of assessors in the grades they award?), further research should aim to establish interventions (feedback) with explicit aims and to examine if the behaviour of assessors can be influenced in a predictable fashion. On a wider note, additional research might focus on the grades awarded by different academic departments or divisions within medical schools.

Conclusions

Our findings suggest that the χ² statistic may in principle be used to identify individual assessors or SSCs where the grades awarded are different from the average, although the interpretation of the results thus obtained is fraught with difficulty. Our finding that the grades awarded by supervisors post feedback were significantly higher than the grades awarded pre feedback suggests that it is possible to influence assessors in the grades they award.

Acknowledgements

The authors wish to acknowledge the contribution of staff and students at the University of Dundee Medical School without whose co-operation this study would not have been possible.

Declaration of interest: The authors report no conflicts of interest. The authors alone are responsible for the content and writing of the paper.

Additional information

Notes on contributors

Michael J. Murphy

Michael J Murphy, BA(Mod), MB BCh BAO, FRCP Edin, FRCPath is Senior Lecturer in Biochemical Medicine, and SSC Convenor, at the University of Dundee, Scotland, UK.

Rohini De A. Seneviratne

Rohini De Alwis Seneviratne, MD, MBBS, MMEd, FCCP(SL) is Professor in Community Medicine, Faculty of Medicine, University of Colombo, Sri Lanka.

Olga J. Remers

Olga J Remers, BSc, MSc is Assessment Administrative Assistant (SSCs) at the University of Dundee Medical School.

Margery H. Davis

Margery H Davis, MD, MB ChB, FRCP is Director of the Centre for Medical Education and Professor of Medical Education, University of Dundee.

Related Research Data

A comparison of global ratings and checklist scores from an undergraduate assessment using an anesthesia simulator.

Source: Ovid Technologies (Wolters Kluwer Health)

Student selected components: do students learn what teachers think they teach?

Source: Informa UK Limited

Exploring assessor consistency in a Health and Social Care qualification using a sociocultural perspective

Source: Informa UK Limited

Can student tutors act as examiners in an objective structured clinical examination

Source: Wiley

Measuring inter-rater reliability of the sequenced performance inventory and reflective assessment of learning (SPIRAL).

Source: Ovid Technologies (Wolters Kluwer Health)

Assessor training: its effects on criterion‐based assessment in a medical context

Source: Informa UK Limited

The assessment of student reasoning in the context of a clinically oriented PBL program

Source: Informa UK Limited

Linking provided by

References

Anderson K, Peterson R, Tonkin A, Cleary E. The assessment of student reasoning in the context of a clinically oriented PBL program. Med Teach 2008; 30: 787–94
PubMed Web of Science ®Google Scholar
Beckman TJ, Cook DA, Mandrekar JN. Factor instability of clinical teaching assessment scores among general internists and cardiologists. Med Educ 2006; 40: 1209–1216
PubMed Web of Science ®Google Scholar
Burnett E, Phillips G, Ker JS. From theory to practice in learning about healthcare associated infections: Reliable assessment of final year medical students' ability to reflect. Med Teach 2008; 30: e157–60
PubMed Web of Science ®Google Scholar
Chenot JF, Simmenroth-Nayda A, Koch A, Fischer T, Scherer M, Emmert B, Stanske B, Kochen MM, Himmel W. Can student tutors act as examiners in an objective structured clinical examination?. Med Educ 2007; 41: 1032–8
PubMed Web of Science ®Google Scholar
Daelmans HEM, van der Hem-Stokroos HH, Hoogenboom RJI, Scherpbier AJJA, Stehouwer CDA, van der Vleuten CPM. Global clinical performance rating, reliability and validity in an undergraduate clerkship. Neth J Med 2005; 63: 279–284
PubMed Web of Science ®Google Scholar
Donnon T, Oddone-Paolucci E, Violato C. A predictive validity study of medical judgment vignettes to assess students' noncognitive attributes: A 3-year prospective longitudinal study. Med Teach 2009; 31: e148–55
PubMed Web of Science ®Google Scholar
Friedman Ben-David M. AMEE Medical Education Guide 18. Standard setting in student assessment. Med Teach 2000; 22: 120–130
Google Scholar
Johnson M. Exploring assessor consistency in a health and social care qualification using a sociocultural perspective. J Voc Educ Train 2008; 60: 173–87
Google Scholar
Metheny WP. Limitations of physician ratings in the assessment of student clinical performance in an obstetrics and gynecology clerkship. Obstet Gynecol 1991; 78: 136–41
PubMed Web of Science ®Google Scholar
Morgan PJ, Cleave-Hogg D, Guest CB. A comparison of global ratings and checklist scores from an undergraduate assessment using an anesthesia simulator. Acad Med 2001; 76: 1053–5
PubMed Web of Science ®Google Scholar
Murphy MJ, Seneviratne RdeA, McAleer SP, Remers OJ, Davis MH. Do students learn what teachers think they teach?. Med Teach 2008; 30: e175–e179
PubMed Web of Science ®Google Scholar
Olson L, Schieve AD, Ruit KG, Vari RC. Measuring inter-rater reliability of the sequenced performance inventory and reflective assessment of learning (SPIRAL). Acad Med 2003; 78: 844–50
PubMed Web of Science ®Google Scholar
Pell G, Homer MS, Roberts TE. Assessor training: Its effects on criterion-based assessment in a medical context. Int J Res & Meth Educ 2008; 31: 143–154
Google Scholar
Pitts J, Coles C, Thomas P. Educational portfolios in the assessment of general practice trainers: Reliability of assessors. Med Educ 1999; 33: 515–520
PubMed Web of Science ®Google Scholar
Ziv A, Rubin O, Moshinsky A, Gafni N, Kotler M, Dagan Y, Lichtenberg D, Mekori YA, Mittelman M. MOR: A simulation-based assessment centre for evaluating the personal and interpersonal qualities of medical school candidates. Med Educ 2008; 42: 991–8
PubMed Web of Science ®Google Scholar

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Order Reprints Request Corporate Permissions

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

Request Academic Permissions

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.

Download PDF

Share icon
Back to Top

Related research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.

People also read
Recommended articles
Cited by

To cite this article:

Reference style: APA Chicago Harvard

Citation copied to clipboard

Reference styles above use APA (6th edition), Chicago (16th edition) & Harvard (10th edition)

Download citation

Download a citation file in RIS format that can be imported by citation management software including EndNote, ProCite, RefWorks and Reference Manager.

Choose format: RIS BibTex RefWorks Direct Export

Choose options: Citation Citation & abstract Citation & references

Your download is now in progress and you may close this window

Did you know that with a free Taylor & Francis Online account you can gain access to the following benefits?

Choose new content alerts to be informed about new research of interest to you
Easy remote access to your institution's subscriptions on any device, from any location
Save your searches and schedule alerts to send you new results
Export your search results into a .csv file to support your research

Have an account?
Login now Don't have an account?
Register for free

Login or register to access this feature

Have an account?
Login now Don't have an account?
Register for free

Choose new content alerts to be informed about new research of interest to you
Easy remote access to your institution's subscriptions on any device, from any location
Save your searches and schedule alerts to send you new results
Export your search results into a .csv file to support your research

‘Hawks’ and ‘doves’: effect of feedback on grades awarded by supervisors of student selected components

Abstract

Introduction

Background

Methods

Results

Table 1. Comparison of grades awarded in the Phase 2 programme over four consecutive years

Table 2. Breakdown of grades awarded over four years in Phase 2 SSC programme and in selected SSCs

Discussion

Limitations of the analysis

Implications for research and practice

Conclusions

Acknowledgements

Notes on contributors

Michael J. Murphy

Rohini De A. Seneviratne

Olga J. Remers

Margery H. Davis

Related Research Data

References

Information for

Open access

Opportunities

Help and information

‘Hawks’ and ‘doves’: effect of feedback on grades awarded by supervisors of student selected components

Abstract

Introduction

Background

Methods

Results

Table 1. Comparison of grades awarded in the Phase 2 programme over four consecutive years

Table 2. Breakdown of grades awarded over four years in Phase 2 SSC programme and in selected SSCs

Discussion

Limitations of the analysis

Implications for research and practice

Conclusions

Acknowledgements

Additional information

Notes on contributors

Michael J. Murphy

Rohini De A. Seneviratne

Olga J. Remers

Margery H. Davis

Related Research Data

References

Reprints and Corporate Permissions

Academic Permissions

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date