2,229
Views
0
CrossRef citations to date
0
Altmetric
Articles

Fair high-stakes assessment in the long shadow of Covid-19

&
Pages 518-533 | Received 15 Nov 2021, Accepted 11 Apr 2022, Published online: 19 Apr 2022

ABSTRACT

Fairness in assessment has become increasingly topical and controversial in recent years. Assessment theoreticians are writing more about fairness and assessment practitioners have developed processes and good practice to minimise unfairness. There is also increased scrutiny by students, parents and the wider public – not only of the fairness of assessments themselves and their outcomes, but of their use, notably for selection for college or university. This is in a context of continued awareness of inequalities in society and their impact on education and assessment. And on top of all these questions has been the impact – and the continuing long shadow – of Covid. Can there be fair assessment in such an unfair world?

We consider three types of challenge to fair assessment:

•Theoretical challenges

•Challenges from thinking about social justice

•Challenges from the way that statistics were used to award assessment outcomes in 2020 (particularly in England)

Introduction

In recent years, fairness in testing and assessment has been debated increasingly in many countries and across different contexts. This discussion has been particularly targeted at assessments (and qualifications) which are ‘high stakes’, by which we mean that the outcomes matter greatly to test-takers and may determine aspects of their future lives. These include entry tests for selective schools, national examinations taken at the end of secondary education (or high school) and qualifications required for entry to a profession. In the United Kingdom, the debate was sharpened by reactions to modifications made to the system for awarding grades in 2020 and 2021 when, because of Covid-19, it was judged to be impossible to hold the traditional terminal written examinations.

This article starts with a brief conceptual account of fairness, distinguishing several senses of ‘fair’ that are too often confused and identifying some senses that are wrongly overlooked. Two underlying concepts – equality (in relevant respects) and deserved outcome- constitute philosophical bases for an understanding of fairness. Next, we provide a brief account of current received views of fairness by the international assessment community. We then consider three contemporary challenges to high stakes fair assessment in light of the Covid experience:

  • (1) Theoretical challenges (including challenges from assessment theory)

  • (2) Challenges from thinking about social justice

  • (3) Challenges from the way that statistics were used to award assessment outcomes in 2020, particularly in England.

Our main conclusion is that the concept of high stakes assessment fairness developed by the assessment community has been too narrow, derived from reflection on other (internal) assessment values, notably validity. We see fairness as essentially a wider concept, derived from views about society, and we argue that it should be applied – from outside – to assessment. We see fair assessment as a ‘connected’ concept, applied to assessments as events in time in a social context.

In this discussion we are not depicting fairness and unfairness as binary concepts, where something is either fair or unfair (Nisbet & Shaw, Citation2020, p. 154). Fairness should be perceived as being located on a continuum, in the same way as we think of the concept of ‘health’. It would be wholly unrealistic to expect any assessment to be ‘absolutely’ fair. It would, however, be reasonable to expect an assessment to be made fairer, and to look for progress along the continuum of fairness.Footnote1

There are two limitations to the scope of this article which we wish to mention at the outset. The first is that the assessments discussed here are largely educational assessments, and many of the examples cited refer to examinations and tests taken towards the end of school education. We refer in passing to assessments for professional accreditation, but we do not do justice here to fairness issues in vocational and professional contexts, or to selection for employment and promotion in work contexts. The second limitation is geographical: the authors are both based in the UK, and our thinking has largely been influenced by developments there, as well as by discussions and publications from the USA. Almost all the sources cited are in the English language.

This article offers an updated and extended view of assessment fairness, including some reflections on the Covid experience. Fair assessment is a fundamental value for assessment professionals, but also for students, teachers, parents and the wider public. For assessments to command public and professional confidence they must be seen to be fair. The issues discussed here are therefore vitally important for the future of assessment.

Background: the debate about fair assessment

The theoretical profile of fairness has grown within the assessment community and among its academic commentators. Arguably fairness now shares the podium with the fundamental values of validity and reliability (Messick, Citation1989; Worrell, Citation2016). The 2014 edition of the Standards for Educational and Psychological TestingFootnote2 has a leading chapter on fairness (American Educational Research Association, American Psychological Association and National Council on Measurement in Education [AERA, APA, & NCME], Citation2014) and there appears to be a developing theoretical consensus on what constitutes fairness in assessment (though the consensus is not without its detractors, see, Nisbet & Shaw, Citation2019).

But the fairness debate is not confined to assessment ‘insiders’: students and their parents often complain about unfair national or state examinations and there are ongoing scandals about cheating by teachers marking tests. The use of entry tests or examination outcomes as a basis for selection for post-school education has been challenged as unfair, notably in parts of the USA, where increasingly, for reasons of fairness, universities and colleges are prioritising student selection based on a review of a portfolio of information which may include student statements, letters of recommendation, transcripts and other entry admissions tests (Maruyama, Citation2012; Wiley et al., Citation2010). There is also active debate in the UK about fair selection for university.Footnote3) And these exchanges are taking place in social, political and cultural contexts which are seen by many as unfair, raising the question of whether there can be such a thing as a fair test in such an unfair world.

Education lies at the heart of much contemporary discussion of (in)justice and (in)equality in society, and those debates have important implications for educational assessment. There are early indications from research that systemic inequalities in many countries have been exacerbated by the experience of the Covid pandemic, notably by the differential losses of learning experienced by students (Dorn et al., Citation2020; EPI/Nuffield, Citation2020). Heightened awareness of – and concern about – these persisting inequalities is an important part of the context for our discussion in this article of challenges to fairness from thinking about social justice.

Concerns about inequality and underserved assessment outcomes came to the fore in the reactions to some alternative approaches taken to high-stakes assessments in 2020 and 2021. The third of the challenges to fair assessment which we consider in this article is based on the experience in the UK of developing proposals to make greater use of statistics in awarding grades for national qualifications, when the traditional written examinations were deemed impossible because of Covid. Initial plans in 2020 were to award grades based on teachers’ recommendations but modified by the application of a statistical model, based on past outcomes. However, the grades awarded by this approach were widely perceived as unfair, and this use of ‘algorithms’ was eventually abandoned. Broadly similar arguments were voiced in all four UK jurisdictions, but we have primarily referred to discussions in England.

Philosophical bases for fairness – equality and desert

We have argued elsewhere that philosophical accounts of fairness have largely been based on two fundamental concepts – equality and desert. We have distinguished a range of senses of ‘fair’ that can be applied to assessment: of these, perhaps the most important are a relational sense (treating like cases alike) – which is a form of equality – and a retributive sense, where a fair outcome is an appropriate reward or penalty for what has gone before, and hence is deserved. (Nisbet & Shaw, Citation2019, Citation2020; Shaw & Nisbet, Citation2021). The relational sense of fairness lies at the heart of the received view of assessment fairness. However, the retributive sense must not be ignored. The concept of desert is often associated with individuals (each candidate gets the grade he or she deserves), rather than with groups, and is linked to the sense of ‘fair’ in which a fair outcome is what those affected can legitimately expect.

The balance between the concepts of equality and desert can be understood using two different perspectives. Focusing on the test itself, a fair assessment accurately measures the relevant knowledge, understanding or skill of the test taker, and discussion of fairness in this sense often uses the language of ‘accuracy’ and ‘reliability’. A broader perspective sees the assessment in context, with a fair assessment outcome seen as ‘deserved’ because of the effort that the student has done or because it matches some other evidence of the student’s ability.

Received view of fair assessment

The received view of fairness by assessment professionals and academics, reflected in authoritative documents such as the Standards for Educational and Psychological Testing (hereafter Standards), sees fairness as an absence of unfairness, with unfairness shown by construct-irrelevant variance in assessment outcomes. Messick (Citation1998) characterised fairness as comparable validity for relevant, identifiable populations of interest. It is implicit, though often not explicit, in this approach that the relevant populations (to whom the assessment should be fair) are sub-groups of test-takers.

The received view overly focusses on groups of candidates (as opposed to individuals). In addition, the consensus focusses almost entirely on relational fairness and arguably does not do justice to the retributive senses of ‘fair’ or the importance of legitimate expectations. Also, discussions of relational fairness often fail to ask why it matters. There are certainly contexts where it is undeniably morally important to treat candidates equally. An example often cited is highly competitive entry to selective university courses. However, the importance of relational (un)fairness is less persuasive if there is no competitive access to limited goods – for example, if the main purpose of assessments is to provide rich information about individual students to inform the next stage of their education.

Theoretical challenges

‘Conditional fairness’

Prompted by an increasing awareness and understanding of more diverse populations of Mislevy et al. (Citation2013), Mislevy (Citation2018)) have advanced a ‘conditional’ sense of fairness which they apply to educational measurement. ‘Conditional fairness’ is expressed through a cognitive diagnosis model which combines an Item Response Theory and a binary skills model. They challenge the notion that fair assessment requires eliciting performances under identical surface procedures (exact same assessment design, question paper development, evaluation criteria used, administration, marking). In Mislevy’s view identical assessment tasks may not always provide the same information about the knowledge and skills of all candidates.

By way of an alternative approach, Mislevy (Citation2018) proposes ‘conditional fairness’ as a reasoned foundation for adapting to the interest and previous knowledge of the test-taker. The ‘universal design’ of assessment tasks may make the tasks less comparable across individual test-takers, but has the ultimate objective of generating evidence that is in fact more comparable. Mislevy argues that performance of a task must be understood taking into account the person, background, tasks, settings and situations. There is, therefore, a need to locate the ‘assessment argument’ (the tasks, evidence and inferences drawn from the assessment outcomes) in context. Variations are not (statistically) random differences, but should be taken into account from the outset in the assessment argument, with the theoretical structure for ‘universal design’ accommodating differences from the outset, not just ‘retro-fitting’ a standard design where needed.

We welcome the theoretical challenge of Mislevy’s notion of ‘conditional fairness’, and his insight that when judgements of fairness are at stake, the assessment argument (tasks – evidence – inferences) must be seen in context. The detailed revised structure that Mislevy has developed for the assessment argument is an important addition to theoretical thinking about fair assessment. We agree that in many contexts the principles of universal design can support fairer assessment and that varied tasks, informed by universal design, can lead to equivalent outcomes. The insights and good practices which constitute universal design and which serve to maximise accessibility for all test-takers (regardless of construct-irrelevant characteristics) are very necessary and useful, although universal design does not obviate the need to conduct a raft of carefully conceived validity studies.

However, from the perspective of fairness, there are contexts in which Mislevy’s argument can be extended further. Is equivalence always required for fairness? Assessment arrangements which have had to be adapted because of Covid have led to outcomes which are arguably not equivalent for students in different settings and with different experience of loss of learning and personal hardship. Nevertheless, the adapted assessment arrangements may be fair in the sense that they have given as many students as possible an opportunity to show what they can do. And, beyond the special circumstances of Covid, there are arguably many purposes for assessment where equivalence of consideration for all test takers is not the highest priority. For example, in countries where the norm is for students to proceed from high school to the local university or college, it may be more important to obtain relevant information about the learning of each student than to apply equivalent criteria to each and every one of them.

Views of validity

A further theoretical challenge is to relate our thinking about fairness to developing views of validity. To what extent is fairness a condition for validity? Is fairness a subset of validity? Can a test be valid but not fair – or unfair but valid?

Whilst these are all perfectly legitimate questions to ask, the answers will depend on how we conceptualise the two concepts, and perhaps more importantly, on how broadly (or narrowly) we define each of them (Kane, Citation2010; Newton & Shaw, Citation2014; Nisbet & Shaw, Citation2019). We have argued elsewhere that fairness is a necessary but not sufficient condition for validity. (Nisbet & Shaw, Citation2020). This would mean that in order to be valid, a test must be fair. This view would see fairness as an essential requirement for validity – not just one of a series of optionsFootnote4 – but state that validity is more than fairness, and that a fair test may not be valid. This argument is contingent on how validity is conceptualised. At any rate, validity and fairness are closely interrelated, and both are essential to public confidence in tests and their outcomes (even if the language of fairness is more familiar in public discourse than is that of validity). We need to provide a language in which such conflicting requirements of assessment can be expressed and analysed.

The notion of validity has evolved gradually from disparate and contested origins to a point where there is now strong professional consensus over a precise, technical meaning. In contrast, references by assessment theorists to fairness (appearing much later in the educational and psychological testing narrative) have grown out of a technical formulation of validity almost as a ‘by-product’. Validity refers to how test scores are both understood (in light of the constructs the tests purport to elicit) and used (for making decisions). The close perceived relationship between fairness and validity is exemplified in the definition of fairness provided in the 2014 issue of the Standards: ‘The validity of test score interpretations for intended use(s) for individuals from all relevant subgroups’ (AERA, APA, NCME, Citation2014, p. 219). Fairness (as instantiated and developed through later editions of the Standards and other such documents) could be described as an offspring of validity. Indeed, mainstream assessment theoreticians now understand fairness as an ‘issue’ of validity.

In contrast, we see fairness as essentially a wider concept, applied to many aspects of society and human activities, and understood and used by people from many backgrounds. When fair assessment is discussed, we suggest that fairness should be seen as a concept which has come into assessment discourse from outside – from the wider world, where assessments take place.

This direction of thought, in the context of the pandemic experience which has required changes and adaptations to assessments, may also shed new light on our thinking about validity itself. Three possible approaches to validity can be distinguished. They are:

  1. An inside approach to validity, focussing on the test instrument, that is, on the coherence of the structure and content of the assessment for its intended, specific purpose(s)

  2. An inside/outside (relational) approach to validity, looking at the relation between the test and some real-life contextual factors both before the test (for example, opportunity to learn) and after the test (validity of use)

  3. An approach to validity from outside, founded on a consideration of an assessment as a real-world event in a real context and deriving conclusions about the assessment. Using this approach, a test can be valid in some contexts and invalid in others.

We offer these three approaches for further analysis and consideration. The arguments in the rest of this article support a move from (a) towards (b) and (c).

The challenges from thinking about social justice

The second set of challenges to thinking about fairness in assessment come from consideration of social In the English language ‘justice’ is normally considered as a wider concept than ‘fairness’. In the context of assessment, talk of ‘social justice’ widens the focus of consideration to the context in which assessments are taken – including events before and after the assessment itself and social factors which may influence the assessment outcomes. We have written elsewhere commending such a ‘situated’ view of assessment (Nisbet & Shaw, Citation2020, p. 155), and it is consistent with the wider view of validity which we have advocated in the previous section, as well as with the theoretical insights of Messick’s account of ‘conditional fairness’. The situated view considers assessments in the contexts in which they are taken and the outcomes are used. In the situated view, ‘an assessment is considered as an event at a point in place and time, taking into account what else we know about what happened before and after, who the candidates were, and so on’ (Nisbet & Shaw, Citation2020, pp. 106–7).

A contemporary challenge from social justice is the manifest injustice of continuing inequality of assessment outcomes. As Putnam has vividly illustrated, different kinds of disadvantage are increasingly ‘clustering’ (Putnam, Citation2015) and the widening gap between multiple social advantage and disadvantage is clearly reflected in differential educational achievement. Those concerns have increased because of the differential impact of the Covid pandemic within and between countries across the world.

One cause of social injustice, referring to events before an assessment is taken, is differential opportunity to learn. If some students have not been taught the content being assessed, or have been taught it badly, they will be disadvantaged in the assessment, compared to other students who have been taught the relevant content well. This seems a clear-cut example of relational unfairness. It is also unfair in the retributive sense as candidates who have not had the opportunity to learn the content being assessed and who do badly in the test will not deserve their low mark.Footnote5

In a study for McKinsey, Dorn et al. (Citation2020) have charted the differential loss of learning by US students as a result of Covid, using a range of epidemiological scenarios. In all of the scenarios, low-income, black and Hispanic students are likely to have lost more learning than better-off and white students and these are ‘not likely to be temporary shocks easily erased in the next academic year’ (Dorn et al., Citation2020, p. 6). In England, the Educational Policy Institute reports that ‘prior to the pandemic, disadvantaged pupils were already 18 months of learning behind their more affluent peers by the (age of 16). The pandemic has now exacerbated this education gap, undoing a significant amount of the progress made in closing it over the last two decades’ (Education Policy Institute, Citation2021; see also, House of Commons, Citation2021). Across the world, events of 2020/21 have sharpened concerns about educational inequality.

It seems clear that such inequalities of educational input are unjust. But it is more difficult to pursue the argument that equality of educational input – giving all students the same amount and quality of teaching and learning opportunities – would necessarily always be just, even if it were achievable. It would be open to at least two objections. The first is that such a policy could encourage ‘levelling down’ – reducing the level of opportunities for the advantaged for the sake of matching the lower level available for the disadvantaged. The second objection is that there may be a conflict with the freedom of parents who value education and have the resources to put more into their children’s education than other parents can or may wish to (Satz, Citation2007).

In our perception the perceived conflict with freedom, described by Satz as a ‘deep tension’, is more acutely felt in the US than in Europe. This may reflect the high salience of liberty in US public discourse, including consciousness of a tension between Federal and State authority. However, the potential tension is there for all countries, and the contemporary challenge to fair assessment is clear: What approach to education – and educational assessment – would reconcile the demands of equality and freedom and be more socially just than the current situation?

The response of Elizabeth Anderson (and others) to this challenge has been to advocate an ‘adequacy’ standard of education for all (Anderson, Citation1999, Citation2007; Anderson & White, Citation2019; Satz, Citation2007). All have an entitlement to education to the standard of ‘adequacy’ but variation above that level is permitted. The ‘adequacy’ standard is defined in relation to Anderson’s ideas of ‘democratic equality’, where all are able to play a full part in democratic society. And Satz (Citation2007) claimed that in the US system this required ‘everyone with the potential [to have] the skills needed for college’ (p. 638).

We do not have space here to do justice to the richness of this approach, which we have discussed at greater length elsewhere (Nisbet & Shaw, Citation2020, pp. 130−132). In summary, there are several potential objections, the most obvious being that in allowing scope for rich parents to top up their children’s education above the ‘adequacy’ standard, it appears to allow an uncomfortable amount of inequality to persist. More specifically, there is a risk that if all students are judged to be ready for entrance to college, the inequality may simply be postponed to a later stage and be reflected in differential college completion rates.

The ‘adequacy’ movement raises some particular questions regarding fair assessment. How is college-readiness to be assessed? Anderson advocates assessment which ranges beyond traditional academic subjects, but arguably even a wider assessment, including, for example, community activities and sport, may particularly benefit children from richer families (Putnam, Citation2015). In addition, the adequacy champions do not answer the question of what constitutes fair assessment of those who do not have the potential to go to college. Even in a theoretical world where all students had the same quality of teaching and support to develop their potential, there will arguably be some who cannot be made ‘college-ready’. Social justice demands a view on how they can be educated to contribute fully to society and the role of assessment in supporting such education. Perhaps the standard required for ‘adequacy’ cannot be the same for all students.

Finally, what are the implications for social justice – and for fair assessment – of the differential loss of learning by students as a result of Covid? One possible response, we suggest, is to take a more flexible view of the contributions of different stages of education to the development of students’ knowledge and skills. Some students may not have had the opportunity to become ‘college ready’ by the time they leave school, because of lost teaching time and lack of resources and support to learn at home. Colleges and universities may have to take account of that in their expectations of entrants and provide flexible opportunities for students to learn content at college that would normally be covered at school. And arrangements at national or state level for summative assessment need to be flexible enough to be used at different stages, rather than being confined to snapshots of students’ knowledge and skills taken at pre-determined points in their lives.

The challenge from the way that statistics were used to award assessment outcomes in 2020

The third challenge to fair assessment comes from proposals in the UK, and particularly in England, to make greater use of statistics in the awarding of exam grades in 2020. There is nothing new in using statistics – collections of numbers – to inform decisions on the outcomes of national, international, or state examinations. But the debate about their use in the context of Covid has raised strong feelings about unfairness, and hence has furnished a further challenge to thinking about fair assessment.

Mike Cresswell has described the process of awarding grades in graded national examinations as involving both statistics and expert judgment (Cresswell, Citation2000), and while there have been differences of view about the comparative balance between the two factors, many have agreed with Cresswell that there is a need for quantitative information – statistics – to help apply a standard to qualitatively different attainments at different times and in different subjects (see p. 73). In England the regulator, Ofqual, sets out a process for establishing grade boundaries in national examinations through applying examiner judgment (on the quality of the candidates’ papers) informed by statistical information about past trends (Ofqual, Citation2016, p. 6).

In 2020 and 2021 several countries and international education organisations, including the four jurisdictions of the United Kingdom, cancelled examinations because of Covid. In the UK, governments tasked the regulators to develop processes for awarding grades that maintained standards from previous years and could be applied in the same way across the relevant country. Although the approaches were not identical across the four parts of the UK, there was broad similarity in proposing that grades estimated by teachers should be modified by applying a statistical model based on historical information about schools and individuals and rank orderings by teachers for each subject. Possible models were consulted on and extensively tested, but when the selected model was applied there were loud cries of ‘unfair’ by students – and their parents – for whom application of the statistical model meant that they received a lower grade than their teacher had recommended. Public anger was palpable: distraught students protested outside government buildings and they and their parents flooded the media. One by one the UK countries abandoned the use of the statistical models, and famously the UK Prime Minister remarked to some students ‘I am afraid your grades were almost derailed by a mutant algorithm and I know how stressful that must have been’.Footnote6

There can be discussion about which statistical models should be described as ‘algorithms’. We have been guided by the (UK) Office for Statistics Regulation (OSR), which reported on statistical aspects of the public disquiet about exams:

Modelled relationships between variables can be used to estimate the unknown value of one (or more) of those variables (for example, an exam grade) from known values for other variables (e.g., past performance). When cast in the form of a sequence of calculation instructions, this process is called an algorithm for estimating the unknown value (Office for Statistics Regulation, Citation2021, p. 19).

OSR have observed that there is some overlap in the use of ‘statistical model’ and ‘algorithm’, but conclude that in the exams context the term ‘algorithm’ can be applied (ibid, p. 18). The use of that term in the context of exams may have heightened the public disquiet about its use, but, as we argue later in this section, the best response to that is to learn from wider good practice in the use of algorithms, rather than to try to avoid using the term.

In Norway, the Data Protection Authority challenged the International Baccalaureate Organisation on the grounds that their use of a statistical model to award outcomes to individual students was in breach of the fairness requirement set out in European data protection legislation ‘as the awarding of grades should be based on [individual students’] demonstrable academic achievements and not on historical data relating to the academic achievements of other students in the past’.Footnote7 The case was eventually closed, apparently because of issues concerning competency.Footnote8 But the challenge remained, namely that the application of a statistical model to award marks or grades to individual students was unfair.

And in the UK, when thoughts turned to planning for estimating grades in 2021, it was believed that the public would not stomach any use of an ‘algorithm’. Indeed, in January 2021, the Secretary of State for Education (in England) announced to Parliament ‘This year we will put our trust in teachers rather than algorithms’.Footnote9

The Ada Lovelace Institute defines ‘algorithm’ as ‘a series of steps through which particular inputs can be turned into outputs (Citation2021, p. 10)’ and ‘algorithmic system’ as ‘a system that uses one or more algorithms, usually as part of computer software, to produce outputs that can be used for making decisions’ (Citation2021, p. 10). They offer a functional definition of ‘algorithmic system’ as ‘a system that uses automated reasoning to aid or replace a decision-making process that would otherwise be performed by humans’ (Ada Lovelace Institute, Citation2021, p. 4, italics added). The use of algorithmic systems to support decision-making in public services – for example, in urban planning, welfare services and law enforcement – is increasingly common (Ada Lovelace Institute, Citation2021). The ability of Artificial Intelligence to use and combine unprecedently large amounts of information should enable decisions to be based on more evidence than ever before, which many would see as a public good. However, controversy about the use of algorithms is familiar – see, for example, the discussion of the use of algorithmic systems in policing in Moses and Chan (Citation2018). And what became known in the UK as ‘the exams fiasco’ in 2020 caused the statistics regulator to worry that assessment had given algorithms a bad name, and would make it more difficult for their use to command public confidence in future (Office for Statistics Regulation, Citation2021).

We shall confine our discussion here to the challenge to fair educational assessment from the attempted use of algorithmic systems in 2020 in England. As we have remarked, there was nothing new in using statistics to inform grades (although the Office for Statistics Regulation doubted whether the public knew this (Office for Statistics Regulation, Citation2021, p. 10)). The real difference in 2020 was the lightness of evidence supporting the human side of the coalition between statistics and expert judgment. There were no scripts for examiners to read, and the only experts available to judge individual students were their teachers. Their judgment was seen as possibly unfair, because they might apply different standards from those used by other teachers and different from those applied nationally in the past. That led to greater weight being given to the statistical model.

In considering the challenges to fairness from this experience, it is important to note that the statistical models developed were designed to promote relational fairness – in the words of Ofqual (Citation2020), fairness ‘across the cohort as a whole’ (p. 9). There was an intention, as far as possible, to apply the same standards to candidates in 2020 as in previous years and to schools across the country. And the systems used were generally approvedFootnote10 in the consultations undertaken as achieving that, as far as was possible.

In our view there are three challenges to fair assessment arising from the use of statistics in the UK in 2020. The first is based on our account of the philosophical bases of fairness resting in the concepts of equality and desert. Those who commissioned and developed the algorithms were explicitly seeking to achieve relational fairness (a kind of equality), but the importance of individual desert for fairness was overlooked. The same members of the public who might respond favourably in a focus group discussion about a system aiming to apply the same standards to all schools might subsequently complain of unfairness if their own son or daughter had the grade recommended by their teacher changed by ‘the algorithm’ without reference to the work that the individual student had done. When the proposed grades were announced, the perspective on fairness adopted by students and their parents – and by teachers, on behalf of their students – emphasised individual desert rather than relational fairness across groups. The first challenge is, therefore, that systems using statistics to determine – or inform – assessment outcomes must consider both relational fairness (equality) and retributive fairness (desert). It is very dangerous for assessment professionals to ignore either aspect.

The second challenge is linked to the first: that fair assessment requires consideration to individuals as well as groups. We have criticised the prevailing assumption in the world of assessment that fairness applies only, or mainly, to groups rather than individuals (Nisbet & Shaw, Citation2020, pp. 20–23). Our theoretical objections to this view were backed up by public opinion in 2020. The qualifications regulators in the UK clearly appreciated that their statistical models could not guarantee to give every student the grade that their work deserved. Their approach was to be as (relationally) fair as possible to identifiable groups in the model used, and reply on the appeals process to resolve any residual unfairness to individuals. This could be – and was – criticised as stressful and difficult for the aggrieved individuals and arguably it was not possible for a manageable appeals process to cater for all the concerns voiced by individuals. In the words of the Independent Review of the process in Wales in summer 2020, ‘there was a failure to put in place a fair and workable appeals process in 2020 that would deal with the known inabilities of the statistical processes to give a fair outcome to every learner’ (Welsh Government, Citation2020, p. 19). We accept that in the absence of exams, particularly in a system relying heavily on terminal assessment, it was going to be difficult to satisfy the ‘individual desert’ wing of fairness. But at the very least that should have been identified at the outset as a significant limitation. The Office for Statistics Regulation commented:

[A] high level of confidence was placed in the ability of statistical models to predict a single grade for an individual on each course whilst also maintaining national standards and not disadvantaging any groups. The limitations of statistical models were not fully communicated. (Office for Statistics Regulation, Citation2021, p. 9)

The third challenge from the use of statistics is a challenge to the parochialism of much policy-making in educational assessment. As we have commented, the use of statistical models for real-life public policy interventions is common in many sectors, and many countries and international organisations have codes of practice and guidance for their use (in the UK, for example, see, Office for Statistics Regulation, Citation2018). Educational policy-makers should refer to, and apply, guidance and principles governing the use of algorithms in other public services. We particularly commend the guidance by the Centre for Data Ethics and Innovation which includes the principle that those who use algorithmic decision-making tools should ‘consider … carefully whether individuals will be fairly treated by the decision-making process that the tool forms part of’ (Centre for Data Ethics and Innovation, Citation2020, p. 10).

Finally, there is a fairness problem underlying the disputes about use of algorithmic systems to determine examination grades. It can be illustrated by a simplified example: Let us assume that a teacher judges that ten of her students have produced work of a ‘grade A standard’ and have a strong probability – say 80% – of achieving an ‘A’ grade in the exam. All students are told that and they expect to gain an ‘A’ grade. However, in ‘normal’ times, 2 out of the ten would not perform well on the day of the exam and would get a ‘B’. In the absence of an exam, the algorithm awards an ‘A’ to the top eight in the group (according to the teacher’s rank ordering) and a ‘B’ to the bottom two. But the students who are awarded a ‘B’ – and their parents – think it is very unfair that their teacher’s estimation of the value of their work has been altered by an algorithm. The question arising from this example, and which we invite the reader to consider, is which is the fair mark to award to the bottom two. The ‘normal’ exam system would have awarded them a ‘B’. If readers judge that an ‘A’ would be a fairer mark (in a retributive sense) as it reflects a human judgment on the individual student’s work, then the challenge from statistics may be to raise questions about the fairness of a system reliant on terminal exams in ‘normal’ times.

Conclusions

Here and elsewhere (Nisbet & Shaw, Citation2019, Citation2020; Shaw & Nisbet, Citation2021) we have distinguished different meanings of ‘fair’, as applied to assessment, but the philosophical roots are in concepts of equality (of a kind) and desert. Both are essential for fair assessment and it is a major mistake to ignore one in favour of the other. The extraordinary experience of living in a pandemic has heightened the challenges to fair assessment discussed here.

The theoretical challenges which we considered included Mislevy’s account of ‘conditional fairness’. We believe that this could be carried further, to the extent of questioning whether equivalence is always required for fairness. We also discussed challenges from developing views of validity and we suggested a wider view of validity than is reflected in the literature to date. In our view the account of fairness developed by the assessment theorists has been too narrow, largely derived from validity. In contrast, our account of fairness, as applied to assessment, sees it as a wider concept, based on views about society and human activities, brought in from the outside world. Fair assessment should be seen as a connected concept, applying societal concepts to assessment events in time and in a context.

The second group of challenges derived from concerns about injustice and inequality in society. We discussed Elizabeth Anderson’s account of ‘adequacy’ as a fair educational standard to be applied to all. The problems with differential opportunity to learn, exacerbated by the pandemic, mean that colleges and universities may have to be flexible in their assumptions to their expectations and allow opportunities for students to catch up on learning that would normally be covered at school.

Finally, we discussed the challenge from the use of statistics to determine assessment outcomes, without the normal partnership with human judgment on individual scripts. Human beings and statistics need each otherFootnote11 and educational policy-makers can learn from guidance and principles governing the use of statistics in other contexts. It is essential that systems for using statistics to determine – or inform – assessment outcomes consider both relational fairness (equality) and retributive fairness (desert). The much-criticised ‘algorithms’ proposed for use in the UK countries in 2020 and subsequently abandoned, neglected individual fairness.

The debates about fair assessment differ across countries in their tone and focus, but a common element is acceptance that fairness – in all its aspects – must be considered at every stage of the design, delivery and marking of high-stakes assessments. At the time of writing, the debate continues about the fairest way to approach grading of exams in 2023 and beyond, with the English regulator favouring a ‘staged’ move to the grades awarded in 2019 and before, which were less generous that those awarded by schools in 2020 and 2021. They have justified this approach in terms of fairness to students balanced with maintaining confidence in grades awarded. ‘Grades will be based on how students have performed in exams’ – in other words, they will be retributively fair to individuals – “they will be meaningful and can be trusted by universities, colleges and employers (Ofqual, Citation2021).

In contrast, the Covid-induced experience of having to use alternative sources of evidence for assessment and different modes of assessment have prompted wider reflection by assessment professionals and academics about lessons for the future. An example in England is the work of the Independent Assessment Commission, using the strapline ‘Equitable(sic), Reliable Assessment’ (Independent Assessment Commission, Citation2021). The language of fairness is being used in these exchanges, though there is still a lot of debate to be done before a consensus can be reached, even in one part of one country, on what constitutes fair high-stakes assessment in the future. In this article we have sought to use an analysis based on fairness to shine a light on principles and good practice that can be applied in longer time. The challenges which are discussed here, together with the experience of assessing during and after a pandemic, should extend and update thinking about assessment fairness in theory and practice. We hope that this article will contribute to fairer assessment after Covid than before.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Notes on contributors

Isabel Nisbet

Isabel Nisbet academic background is in philosophy. Her career has been mainly in government and regulation: in 2008 she led the establishment of Ofqual, the independent regulator of examinations and qualifications in England, and she was its first CEO. From 2011-2014 Isabel represented Cambridge International Examinations in SE Asia, based in Singapore. She is currently an Affiliated Lecturer at the Faculty of Education, University of Cambridge.

Stuart Shaw

Stuart Shaw worked for Cambridge Assessment for more than twenty years where he was particularly interested in demonstrating how Cambridge Assessment sought to meet the demands of validity in its assessments. He has a wide range of publications in English second language assessment and educational assessment research. Assessment books include: Examining Writing: Research and practice in assessing second language writing (Shaw & Weir, 2007); The IELTS Writing Assessment Revision Project: towards a revised rating scale (Shaw & Falvey, 2008), Validity in Educational and Psychological Assessment (Newton & Shaw, 2014), Language Rich: Insights from Multilingual Schools (Shaw, Imam & Hughes, 2015), and Is assessment fair? (Nisbet & Shaw, 2020). He is currently an Affiliated Lecturer at the Faculty of Education, University of Cambridge.

Notes

1. According to the fifth issue of the Standards for Educational and Psychological Testing, ‘Absolute fairness to every examinee is impossible to attain, if for no other reasons than the facts that tests have imperfect reliability and that validity in any particular context is a matter of degree. But neither is any alternative selection or evaluation mechanism perfectly fair.’ (AERA, NCME, APA), 1999, p. 73.

2. Jointly sponsored by the American Educational Research Association, the American Psychological Association, and the National Council on Measurement in Education.

3. For example, in a public lecture on 28 October 2021 by the Vice Chancellor of the University of Hertfordshire, available at https://www.youtube.com/watch?v=lGPfRBN2nTk

4. We acknowledge, however, that valid tests could fail to meet some of the wider senses of ‘fair’ – for example, they might not meet ‘legitimate expectations’ derived from undertakings given about the content of the test, or the outcomes might be unequal because of historic injustices affecting test-takers.

5. Arguably, there is also a link to construct validity, but we are confining the discussion at this point to considerations of fairness.

6. BBC News, 26 August 2020, accessed on 31 October 2021 at A-levels and GCSEs: Boris Johnson blames ‘mutant algorithm’ for exam fiasco – BBC News

7. See The Norwegian data protection authority threatens to impose legal measures against the international baccalaureate organisation | EuroCloud Europe

8. see BY THE NORWEGIAN DATA PROTECTION AUTHORITY: it closes the case IB – PRIVACY365 | EUROPE

9. Hansard, House of Commons, 6 January 2021, col 764

10. With the exception of objections on several fronts from the Royal Statistical Society (Citation2020).

11. In developing this line of argument we have drawn on discussion at a seminar on 2 March 2021, in a series entitled ‘Politics, Policy and the Algorithm’, organised by Open Data Manchester (https://opendatamanchester.org.uk) Data Manchester, entitled ‘Politics, Policy and the Algorithm’

References

  • Ada Lovelace Institute, AI Now Institute and Open Government Partnership. (2021, August 24). Algorithmic accountability for the public sector: Learning from the first wave of policy implementation. Retrieved October 31, 2021,from https://www.adalovelaceinstitute.org/report/algorithmic-accountability-public-sector/
  • American Educational Research Association, American Psychological Association, and National Council on Measurement in Education. (2014). Standards for educational and psychological testing.
  • Anderson, E. (1999). What is the point of equality? Ethics, 109(2), 287–337. https://doi.org/10.1086/233897
  • Anderson, E. (2007). Fair opportunity in education: A democratic equality perspective. Ethics, 117(4), 595–622. https://doi.org/10.1086/518806
  • Anderson, E., & White, J. (2019). Elizabeth Anderson interviewed by John White. Journal of Philosophy of Education, 53(1), 5–20. https://doi.org/10.1111/1467-9752.12336
  • Centre for Data Ethics and Innovation. (2020). Independent Report, Review into bias in algorithmic decision-making. Published 27 November 2020. Accessed on 29 October 2021 at Review into bias in algorithmic decision-making - GOV.UK www.gov.uk
  • Cresswell, M. (2000). The role of public examinations in defining and monitoring national standards. Proceedings of the British Academy, 102, 69–120.
  • Dorn, E., Hancock, B., Sarakatsannis, J., & Viruleg, E. (2020). COVID-19 and learning loss–disparities grow and students need help. McKinsey & Co. https://www.mckinsey.com/industries/public-and-social-sector/our-insights/covid-19-andlearning-loss-disparities-grow-and-students-need-he
  • Education Policy Institute. (2021, October 21). Website introduction. In J. Andrews, T. Archer, W. Crenna-Jennings, N. Perera, & L. Sibieta (Eds.), Education recovery and resilience in England: Phase two report (pp. 1). Retrieved November 8, 2021, from https://epi.org.uk/publications-and-research/education-recovery-and-resilience-in-england-phase-two-report-october-2021/
  • EPI/Nuffield. (2020, October). Analysis: School attendance rates across the UK since full reopening. Education Policy Institute. https://epi.org.uk/publications-and-research/school-attendance-and-lost-schooling-across-england-since-full-reopening/
  • European Data Protection Board. (2020). Guidelines 4/2019 on article 25 data protection by design and by default version 2.0 Adopted on 20 October 2020. Retrieved October 31, 2021, from https://edpb.europa.eu/sites/default/files/files/file1/edpb_guidelines_201904_dataprotection_by_design_and_by_default_v2.0_en.pdf
  • House of Commons. (2021). Committee of Public Accounts. COVID-19: Support for children’s education. HC 240. Published on 26 May 2021. Accessed on 10 November 2021 at COVID-19: Education (parliament.uk)
  • Independent Assessment Commission (2021). The Future of Assessment and Qualifications om England: Interim Report. Accessed on 24 December 2021 at Findings | New Era Assessment
  • Kane, M. (2010). Validity and fairness. Language Testing, 27(2), 177–182. https://doi.org/10.1177/0265532209349467
  • Maruyama, G. (2012). Assessing college readiness: Should we be satisfied with ACT or other threshold scores? Educational Researcher, 41(7), 252–261. https://doi.org/10.3102/0013189X12455095
  • Messick, S. (1989). Validity. In R. Linn (Ed.), Educational measurement (3rd ed., pp. 13–100). American Council on Education.
  • Messick, S. (1998, August). Consequences of test interpretation and use: The fusion of validity and values in psychological assessment [Paper presentation]. The International Congress of Applied Psychology, San Francisco, CA. ETS Research Report, RR-98-48. Princeton, NJ: Educational Testing Service.
  • Mislevy, R. J. (2018). Sociocognitive foundations of educational measurement. Routledge.
  • Mislevy, R. J., Haertel, G., Cheng, B. H., Ructtinger, L., DeBarger, A., Murray, E., Rose, D., Gravel, J., Colker, A. M., Rutstein, D., & Vendlinski, T. (2013). A “conditional” sense of fairness in assessment. Educational Research and Evaluation, 19(2–3), 121–140. https://doi.org/10.1080/13803611.2013.767614
  • Moses, L. B., & Chan, J. (2018). Algorithmic prediction in policing: Assumptions, evaluation, and accountability. Policing and Society, 28(7), 806–822. https://doi.org/10.1080/10439463.2016.1253695
  • Newton, P. E., & Shaw, S. D. (2014). Validity in educational and psychological assessments. Sage.
  • Nisbet, I., & Shaw, S. D. (2019). Fair assessment viewed through the lenses of measurement theory. Assessment in Education: Principles, Policy & Practice, 26(5), 612–629. https://doi.org/10.1080/0969594X.2019.1586643
  • Nisbet, I., & Shaw, S. D. (2020). Is assessment fair? SAGE Publishing.
  • Office for Statistics Regulation. (2018). Code of Practice for Statistics: Ensuring official statistics serve the public. Accessed on 31 October 2021 at Code-of-Practice-for-Statistics.pdf (statisticsauthority.gov.uk).
  • Office for Statistics Regulation. (2021). Ensuring statistical models command public confidence: Learning lessons from the approach to developing models for awarding grades in the UK in 2020, March 2021. Accessed on 29 October 2021 at Ensuring statistical models command public confidence – Office for Statistics Regulation (statisticsauthority.gov.uk)
  • Ofqual. (2016). Requirements on setting GCSE (9 to 1) grade boundaries: Consultation on Conditions, November 2016 Ofqual/16/6128. Accessed on 31 October 2021 at Requirements on setting GCSE (9 to 1) grade boundaries (publishing.service.gov.uk)
  • Ofqual. (2020). Awarding GCSE, AS, A level, advanced extension awards and extended project qualifications in summer 2020: Interim report. Retrieved October 31, 2021, from https://www.gov.uk/government/publications/awarding-gcse-as-a-levels-in-summer-2020-interim-report
  • Ofqual, S. (2021). Ofqual and Dr Jo Saxton, Ofqual’s approach to grading exams and assessments in summer 2022 and autumn 2021, September 2021. Accessed on 23 December 2021 at Ofqual’s approach to grading exams and assessments in summer 2022 and autumn 2021 - GOV.UK www.gov.uk/ofqual
  • Putnam, R. D. (2015). Our kids: The American dream in crisis. Simon & Schuster.
  • Putnam, R. D. (2015). Our Kids: The American dream in crisis. New York: Simon & Schuster.
  • Royal Statistical Society. (2020). Letter to the Director General for Regulation, Office for Statistics Regulation. 14 August 2020. Accessed on 31 October 2021 at 14-08-2020-Letter-Deborah-Ashby-Sharon-Witherspoon-to-OSR.pdf (rss.org.uk)
  • Satz, D. (2007). Equality, adequacy and education for citizenship. Ethics, 117(4), 623–648. https://doi.org/10.1086/518805
  • Shaw, S. D., & Nisbet, I. (2021). Attitudes to fair assessment in the light of COVID-19. Research Matters: A Cambridge Assessment Publication, 31, 6–21. cambridgeassessment.org.uk/Images/research-matters-31-editorial.pdf
  • Welsh Government. (2020). Independent review of the summer 2020 arrangements to award grades, and considerations for summer 2021: final report December 2020. Accessed on 31 October 2021 at Independent review of the summer 2020 arrangements to award grades, and considerations for summer 2021: final report December 2020 | GOV.WALES
  • Wiley, A., Wyat, J. N., & Camara, W. J. (2010). The development of an index of college readiness. The College Board.
  • Worrell, F. (2016). Commentary on perspectives in fair assessment. In N. J. Dorans, L. L. Cook, & L. L (Eds.), Fairness in educational assessment and measurement (pp. 283–293). Routledge.