Abstract
In May 2008, Ofqual established a two-year programme of research to investigate the nature and extent of (un)reliability within the qualifications, examinations and assessments that it regulated. It was particularly concerned to improve understanding of, and confidence in, this technically complex and politically sensitive phenomenon. The following article presents an account of this programme, from the perspective of one of its initiators, the author. It describes: the context prior to the programme, where little information on (un)reliability was routinely available to the public; the rationale for the programme, in terms of the tension between improving public understanding and the concomitant threat of decreasing public confidence; and ways in which aspects of the programme were constructed through media reports. It concludes with lessons learned from running the programme and with an extended discussion of the challenge of talking about reliability and error.
Acknowledgements
This paper was produced with the support of my employer, Cambridge Assessment, although the views expressed are entirely my own. I would like to thank Andrew Boyle, Isabel Nisbet, Dennis Opposs and John Gardner for very helpful comments on earlier drafts.
Notes
1. While the term reliability has a narrow technical meaning which is generally accepted, there is no generally accepted term to express the overall error that has been described here as measurement inaccuracy. In Newton (2005a), I distinguished ‘measurement inaccuracy’ (i.e. error as the difference between correct and incorrect assessment result) from ‘human error’ (i.e. error as the violation of an assessment procedure). I described both as aspects of ‘assessment error’. There is no clear relationship between measurement inaccuracy and human error since measurement inaccuracy can (and will) arise in the absence of human error, and human error may not actually result in measurement inaccuracy.
2. This will requires the translation of reliability statistics into classification accuracy statistics, with the proviso that these are likely to be underestimates.