952
Views
5
CrossRef citations to date
0
Altmetric
Articles

The reliability of setting grade boundaries using comparative judgement

&
Pages 352-376 | Received 23 Apr 2014, Accepted 06 Mar 2015, Published online: 31 Mar 2015
 

Abstract

In recent years the use of expert judgement to set and maintain examination standards has been increasingly criticised in favour of approaches based on statistical modelling. This paper reviews existing research on this controversy and attempts to unify the evidence within a framework where expertise is utilised in the form of comparative judgement. Initially, the paper introduces a mathematical model for the way in which comparative judgement may operate. Data from existing studies of comparative judgement are then used to estimate suitable parameters for this model. Having derived a working mathematical model for the operation of expert judgement, the paper will then demonstrate that this model provides results that are broadly consistent with existing research, including both studies that are critical and studies that are supportive of the use of expert judgement. The model will then be applied to examine the required scale and design of expert-driven approaches to standard maintaining in order for these to have any chance of being as reliable as the currently favoured, purely statistical approaches. The paper also discusses the minimum conditions required for this approach to designing expert-driven approaches to be appropriate.

Disclosure statement

No potential conflict of interest was reported by the authors.

Notes

1. In fact, comparative judgement may involve consideration of a wider body of a student’s work beyond a single examination script. However, for the purposes of this report and for the sake of easy description, we shall refer to student work as ‘a script’.

2. For example, for a Biology examination, both examination marks and holistic judgement are supposed to yield an idea of how good candidates are at Biology. The exact definition of ‘good at Biology’ may differ between the two but both are interested in the same broad concept.

3. Another convenient property is that a Gumbel distributed variable can easily be simulated by taking minus the natural logarithm of minus the natural logarithm of a uniformly distributed variable.

4. An ad hoc procedure based on the idea that we must be able to decompose the total variance in estimated holistic quality, given the marks awarded to a script, into the sum of variance due to measurement error and variance in underlying holistic quality given the number of awarded marks. More formal methods to address measurement error would have been possible, but since our ultimate aim is simply to derive a good enough working model, and that as seen later in the paper this aim is achieved, such methods were not considered.

5. Note that the same εi must be applied each time a script is viewed by any judge. Different Eik may apply across different judges but if the same judge views the same script multiple times (i.e. in different packs) then the same Eik should be applied.

6. In fact observations are nested within (a small number of) judges – a fact that is likely to significantly increase the sampling error.

7. 1.96 times the square root of 0.7 × (1–0.7)/75 or 0.7 × (1–0.7)/25 expressed in terms of percentage points.

8. This may potentially relate subjective marking in History reducing the strength of the association between awarded marks and holistic quality.

9. That is, excluding English grade C.

10. Note that since the studies involving English GCSE involved eight judges and the studies involving Physics GCE involved ten judges these perhaps should have been simulated separately. However, for the purpose of brevity, just one set of simulations was used. The simulations strike a balance between the situation in English and the situation in Physics by imagining that nine judges were involved in the exercise.

11. The 5th and 95th percentiles are the value chosen so that 9000 out of 10,000 simulations yielded figures between these two values. Only 5% of simulations yielded figures lower than the 5th percentile and only 5% of simulations yielded figures higher than the 95th percentile.

12. The OCR A level with the largest entry.

13. If insufficient benchmark scripts were available with marks at precisely the grade boundary it might be necessary to include benchmark scripts a few marks either side of the boundary.

14. That is, the same examiner may (possibly) view the same benchmark script more than once.

15. That is, any individual judge will only view a script from the new test once, although different judges may (possibly) examine the same script from the new test.

16. In fact they can be seen as being similar to the problem of vertical scaling where it is well known that statistical procedures can also struggle to provide a solution. For example, Kolen and Brennan (Citation2004) summarised research on the issue by noting that ‘the results for vertical scaling depend heavily on the examinee groups, on the data collection design , and on the statistical procedures used’ (476).

Log in via your institution

Log in to Taylor & Francis Online

PDF download + Online access

  • 48 hours access to article PDF & online version
  • Article PDF can be downloaded
  • Article PDF can be printed
USD 53.00 Add to cart

Issue Purchase

  • 30 days online access to complete issue
  • Article PDFs can be downloaded
  • Article PDFs can be printed
USD 538.00 Add to cart

* Local tax will be added as applicable

Related Research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.