Abstract
In recent years the use of expert judgement to set and maintain examination standards has been increasingly criticised in favour of approaches based on statistical modelling. This paper reviews existing research on this controversy and attempts to unify the evidence within a framework where expertise is utilised in the form of comparative judgement. Initially, the paper introduces a mathematical model for the way in which comparative judgement may operate. Data from existing studies of comparative judgement are then used to estimate suitable parameters for this model. Having derived a working mathematical model for the operation of expert judgement, the paper will then demonstrate that this model provides results that are broadly consistent with existing research, including both studies that are critical and studies that are supportive of the use of expert judgement. The model will then be applied to examine the required scale and design of expert-driven approaches to standard maintaining in order for these to have any chance of being as reliable as the currently favoured, purely statistical approaches. The paper also discusses the minimum conditions required for this approach to designing expert-driven approaches to be appropriate.
Disclosure statement
No potential conflict of interest was reported by the authors.
Notes
1. In fact, comparative judgement may involve consideration of a wider body of a student’s work beyond a single examination script. However, for the purposes of this report and for the sake of easy description, we shall refer to student work as ‘a script’.
2. For example, for a Biology examination, both examination marks and holistic judgement are supposed to yield an idea of how good candidates are at Biology. The exact definition of ‘good at Biology’ may differ between the two but both are interested in the same broad concept.
3. Another convenient property is that a Gumbel distributed variable can easily be simulated by taking minus the natural logarithm of minus the natural logarithm of a uniformly distributed variable.
4. An ad hoc procedure based on the idea that we must be able to decompose the total variance in estimated holistic quality, given the marks awarded to a script, into the sum of variance due to measurement error and variance in underlying holistic quality given the number of awarded marks. More formal methods to address measurement error would have been possible, but since our ultimate aim is simply to derive a good enough working model, and that as seen later in the paper this aim is achieved, such methods were not considered.
5. Note that the same εi must be applied each time a script is viewed by any judge. Different Eik may apply across different judges but if the same judge views the same script multiple times (i.e. in different packs) then the same Eik should be applied.
6. In fact observations are nested within (a small number of) judges – a fact that is likely to significantly increase the sampling error.
7. 1.96 times the square root of 0.7 × (1–0.7)/75 or 0.7 × (1–0.7)/25 expressed in terms of percentage points.
8. This may potentially relate subjective marking in History reducing the strength of the association between awarded marks and holistic quality.
9. That is, excluding English grade C.
10. Note that since the studies involving English GCSE involved eight judges and the studies involving Physics GCE involved ten judges these perhaps should have been simulated separately. However, for the purpose of brevity, just one set of simulations was used. The simulations strike a balance between the situation in English and the situation in Physics by imagining that nine judges were involved in the exercise.
11. The 5th and 95th percentiles are the value chosen so that 9000 out of 10,000 simulations yielded figures between these two values. Only 5% of simulations yielded figures lower than the 5th percentile and only 5% of simulations yielded figures higher than the 95th percentile.
12. The OCR A level with the largest entry.
13. If insufficient benchmark scripts were available with marks at precisely the grade boundary it might be necessary to include benchmark scripts a few marks either side of the boundary.
14. That is, the same examiner may (possibly) view the same benchmark script more than once.
15. That is, any individual judge will only view a script from the new test once, although different judges may (possibly) examine the same script from the new test.
16. In fact they can be seen as being similar to the problem of vertical scaling where it is well known that statistical procedures can also struggle to provide a solution. For example, Kolen and Brennan (Citation2004) summarised research on the issue by noting that ‘the results for vertical scaling depend heavily on the examinee groups, on the data collection design , and on the statistical procedures used’ (476).