577
Views
1
CrossRef citations to date
0
Altmetric
Research Articles

Moderation of non-exam assessments: a novel approach using comparative judgement

ORCID Icon, ORCID Icon & ORCID Icon
Pages 32-55 | Received 23 Jun 2022, Accepted 26 Jan 2024, Published online: 15 Feb 2024

ABSTRACT

In England, some secondary-level qualifications comprise non-exam assessments which need to undergo moderation before grading. Currently, moderation is conducted at centre (school) level. This raises challenges for maintaining the standard across centres. Recent technological advances enable novel moderation methods that are no longer bound by centre. This study used simulation to investigate the feasibility of using comparative judgement (CJ) for moderating non-exam assessments. Our study explored the effects of CJ design parameters on the CJ estimates of script quality and how to assign moderator marks after the CJ procedure. The findings showed that certain design parameters had substantial effects on reliability and suggested minimum values for CJ protocols. The method used for assigning moderator marks maintained the rank order of scripts within centres and calibrated the centres to a common standard. Using CJ for moderation could transform current assessment practices, taking advantage of technological developments and ensuring reliability and fairness.

Introduction

In England, the majority of qualifications that are completed as part of secondary school education (i.e. sat at ages 16–18) are assessed via exams, which are set and marked by an awarding body. Some qualifications, however, use non-exam assessments (also known as internal assessments), which are marked by teachers. These non-exam assessments (NEAs) are used to assess different skills to that of an exam and often comprise portfolios, performances, or practical demonstrations of skills (see Vitello & Williamson, Citation2017).

The final marks for NEAs are assigned in a two-stage process, which is the same for all awarding bodies. First, all the candidate work is marked by teachers within the candidates’ centre.Footnote1 Second, a moderation process is conducted by the awarding body which uses a submitted sample of this marked candidate work. The purpose of moderation ‘is to bring the marking of internally-assessed components in all participating centres to an agreed standard’ (JCQ, Citation2019a, p. 13). Globally, there are various approaches to moderation. Billington (Citation2009) describe two main types of moderation: statistical moderation and social moderation. In England, one of the social methods of moderation, inspection (also called expert judgement moderation), is used. Expert judges, known as moderators, check samples of student work to ensure that schools have applied the marking criteria correctly. Moderators are usually teachers who have received training in moderation procedures by the awarding body they are moderating for. Depending on the qualification and awarding body, moderators either attend a centre to do this or receive the materials and carry out the moderation remotely.

The current moderation process is conducted for each centre separately. The process starts with a small sub-sample of candidates’ work (henceforth referred to as ‘scripts’) being viewed and moderated (typically 6 scripts). The size of the full sample submitted for moderation varies with the centre entry size, up to 15 scripts for smaller centres (<100 candidates) and up to a maximum of 25 scripts for centres with the large entries (>100 candidates). If the moderator finds problems with the centres’ marking or they need more evidence, then this sample is extended as per awarding body guidance, until the moderator is either able to make a judgement or refers the centre to a supervisor.

Moderation focuses on two aspects of the assessment results submitted by centres: 1) the rank order of candidates and 2) the marks assigned by teachers. Its purpose is to make sure that the marks reflect the order of candidate ability within each centre and are in line with the standard across centres, ensuring valid and reliable results. Thus, two key tasks of the moderator are to check whether the centre’s rank order of the candidates’ scripts within the sample is correct, and to ascertain whether the centre’s marks for these scripts are acceptable or whether adjustments are necessary. If the rank order is deemed correct then the moderator submits their marks for the moderation sample to the awarding body; if the moderator determines that the rank order is not correct, the centre is directly referred to a supervisor.

The awarding body compares the moderator’s marks with the centre’s marks. If these marks differ beyond a predetermined amount (known as the tolerance level), then adjustments are made to the marks of all candidates in the centre (whether their script was moderated or not) in order to align them to the same standard. In England, a regression algorithm is used to calculate these adjustments based on the relationship between the marks given by the centre and those of the moderator (see Gill, Citation2015). Research looking at all NEAs for one exam series showed that less than one quarter (22.5%) of centres had their marks adjusted (Gill, Citation2015).

As current practice by awarding bodies requires moderation to be conducted at centre level, moderators are able to build up a holistic view of a centre’s approach to marking. However, as each centre is only viewed by a single moderator, this does raise challenges with regard to ensuring the same standard is applied across centres. Current practices to help ensure this involve standardisation training for moderators and monitoring by senior moderators and awarding bodies. In the absence of data on moderator consistency, we can use marking of extended response data as a proxy – for example results from the marking of history essays shows average marker intercorrelations to be between 0.52 and 0.54 (Ofqual, Citation2018).

In recent years, technological advances have allowed electronic submissions of candidates’ work. This opens the door for novel ways of moderating that can move towards a scenario in which candidates’ work is distributed across multiple moderators without being bound by centre, ensuring that the marking standard is consistently applied across centres. The current study explored the use of Comparative Judgement (CJ) as a method for achieving this.

CJ is a technique whereby a series of paired judgements (typically made by multiple judges) is used to generate a measurement scale (Bramley, Citation2007; Pollitt, Citation2012a, Citation2012b). For example, examiners are given pairs of student scripts and judge which script in each pair is the better one. Analysis of these judgements generates an overall rank order of the scripts, and a scale of script quality. There are reported benefits of using CJ compared to traditional marking, in terms of improved reliability, validity and efficiency (see Tarricone & Newhouse, Citation2016; Wheadon et al., Citation2020).

The use of CJ has been explored in various assessment contexts: e.g. for comparability of exams (Bramley, Citation2007; Jones et al., Citation2016), standard maintaining (Benton et al., Citation2022; Curcin et al., Citation2019), and as alternative to traditional marking (Heldsinger & Humphry, Citation2010; Pollitt, Citation2012a; Pollitt & Crisp, Citation2004; Steedle & Ferrara, Citation2016; Walland, Citation2022; Wheadon et al., Citation2020). Its use as an alternative to inspection moderation of NEA would be a novel application.

As one of the key outcomes of CJ analysis is to create an overall rank of the scripts, CJ seems excellently placed to accomplish one of the main tasks of moderators: to determine whether the rank order of the scripts is correct. The other task of moderators, determining the acceptability of a centre’s marks, is a little more complex using CJ. One possible approach is that moderator marks could be assigned as a result of the CJ analysis: the CJ analysis produces a measure of script quality for each script (CJ estimate), which could then be converted into moderator marks on the NEA mark scale. This would deviate from the current process in that these moderator marks would not be directly assigned by a single human judge but instead would be based on combining judgements from multiple judges.

The use of CJ has the potential to streamline the moderation process. For example, the implementation of CJ on digital platforms means that judgements can be made online, by multiple moderators and in parallel. Removing the limitations of time and location also widens the potential pool of moderators. In addition, the moderator would no longer need to submit marks; instead, the moderator marks produced by the CJ analysis could be fed into the current algorithm to make centre adjustments. It now seems a good time to investigate whether CJ could offer a feasible, and potentially more efficient, alternative to the current moderation process used in England.

Research questions

In this article we describe a simulation study designed to investigate the feasibility of CJ as an alternative to the current moderation process. In particular, we wanted to establish certain CJ design parameters that would allow us to obtain reliable results in this context. Namely how many judgements per script would be needed and what the optimum moderation sample size would be. In a meta-analysis of 49 CJ studies, Verhavert et al. (Citation2019) reported that the level of reliability was affected by the number of judgements per script. They found that for reliability levels of 0.70 between 10 and 14 judgements were needed and that a reliability of 0.90 required 26 to 37 judgements. Identifying a minimum number of judgements needed in this moderation context would help us establish if the use of CJ for moderation was practically and financially feasible. Currently, the moderation sample varies in size with centre entry, and may be extended if more information is needed to make a judgement – identifying a minimum sample size for use across all centre entry sizes would negate the need for this variation.

Simulation, rather than an experimental study, was chosen because it allowed us to explore the method efficiently. In particular, it permitted us to manipulate various design parameters in order to find minimum values without the need for numerous and expensive empirical studies. The results could then feed into experimental trials to explore practicalities of using CJ for moderation.

This study focused solely on the moderators’ role in the moderation process i.e. producing moderator marks. Investigating subsequent centre-wide adjustments of marks was beyond the scope of this research.

The overarching aim of the research was to establish whether CJ is a feasible method for moderating NEA. In order to address this aim, the research focussed on the following two questions:

  • What is the minimum number of judgements per script needed to generate ‘usable’ moderator marks?

  • What is the minimum size of moderation sample needed to generate ‘usable’ moderator marks?

A ‘usable’ mark is defined as one that has sufficient reliability that it could be awarded to a candidate. There is no single metric to define this, instead we explored a number of different ones.

Method

Data

The CJ procedure was conducted on simulated datasets of candidates’ centre marks (i.e. the marks that would be awarded by the candidates’ teachers), as these are the focus of moderation procedures. The parameters for these datasets were guided by a reference set of real assessment data from two high entry NEAs delivered by one of the main awarding bodies in England. In total nine datasets were simulated using SAS (SAS Institute, Citation2017), which varied in the number of centres, number of total candidates, and the size of the moderation samples. In all other aspects, the datasets were the same. presents an overview of these datasets.

Table 1. Composition of the simulated datasets.

All the datasets were simulated in the same way. First, we created datasets that contained either 300, 90 or 30 centres to assess whether the method was viable for assessments with both large and small entries. Then we added candidates to each centre. We varied the number of candidates across the centres because, under current moderation protocols, centre entry size determines the size of the moderation sample. We ensured each dataset contained an equal number of small (1–15 candidates), medium (16–100 candidates), and large (101–200 candidates) centres. The number of candidates for any individual centre was randomlyFootnote2 chosen from a uniform distribution of the range corresponding to its centre entry size.

Then we simulated the centre marks for these candidates, using a two-step approach. The first step entailed generating a ‘true’ mark for each simulated candidate, which represented the true quality of the candidate’s script. Based on the NEA reference set, the true marks were integer values simulated from a normal distribution with a mean of 30 and standard deviation (SD) of 10 and constrained within a 1 to 60 mark range. This mark range was the same as the reference NEAs’ range, and the mean was set to 30 as this was the middle of the mark range and was only slightly below the mean in the reference set. A normal distribution was chosen for two reasons. First, the normal distribution has previously been used to simulate exam marks for comparative judgement studies (e.g. Bramley, Citation2015). Second, although actual centre marks were skewed, with large proportions of marks at the top of the mark range, moderator marks were less skewed. Therefore, it was likely that the NEA true marks would follow a ‘close to’ normal distribution.

The second step entailed adding varying degrees of ‘error’ to these true marks to produce the candidates’ centre marks (which, as the true marks, were constrained within a 1 to 60 mark range). We varied the degrees of error such that the datasets contained centres with six types of marking accuracy: accurate, little lenient, strong lenient, little severe, strong severe, and erratic. The logic of introducing these centre types was to determine whether different degrees of marking accuracy could be detected and removed by CJ moderation; if achieved, then the CJ moderation could be deemed to have been successful. For the accurate, lenient and severe centres, all their candidates’ centre marks were set to be the same as, higher than or lower than their true marks, respectively. In each erratic centre, the candidates’ centre marks varied both below and above their true marks. shows the range of deviation in marks for each category. The mark differences from the true marks were randomly chosen from a uniform distribution of the range corresponding to each marking accuracy group. Percentages of centres in each marking accuracy group were broadly based on the distribution of the reference set.

Table 2. Categories of centre marking accuracy and the corresponding degrees of added error.

After the marks were simulated, we moved onto selecting the moderation sample for each centre – that is, the subset of the candidates to undergo the (simulated) CJ procedure. We took two different approaches to determining the size of the moderation sample, which we applied to different datasets: 1) varying the size of the moderation sample across centres and 2) using the same sample size for all centres. For the varied approach, we determined the moderation size using the maximum (current max) under the current practice (for secondary qualifications in England), which varies by centre entry size. That meant sampling all candidates from the small centres (i.e. 1 to 15), 15 candidates from the medium centres and 20 candidates from the large centres. In contrast, for the fixed approach, we set the moderation sample to either 6 or 10 scripts for all centres within the same dataset. We chose these values because a sample size of 6 is the minimum number of candidates that is currently moderated for medium/large centres while 10 is halfway between this minimum and the current max for medium centres. If centres had fewer than the fixed number, then all of their candidates went into the moderation sample.

All moderation samples, regardless of size, had two common characteristics in line with current practice. First, they included at least one candidate who had the lowest mark within the centre’s whole cohort of candidates and one candidate with the highest mark. Second, the samples contained a representative range of marks. This was achieved by dividing the total mark range on the NEA into four equal groups (1–15, 16–30, 31–45 and 46–60) and then selecting the number of candidates from each group that was proportional to the size of the group.

Comparative judgement

We simulated the CJ procedure for a moderation exercise where moderators judge pairs of different candidates’ NEA scripts with regard to their quality. If CJ were conducted with human moderators, the procedure would produce a binary outcome (a win or loss) for each script in each paired comparison. In this study we simulated the CJ using SAS (SAS Institute, Citation2017), following the method described in Bramley (Citation2015). We simulated CJ outcomes (wins and losses) based on the probability of the outcome derived from the candidates’ (simulated) true marks (as these reflect the true quality of the candidates’ scripts), using Equationequation (1):

(1) pA>B=eβAβB1+eβAβB(1)

where p is the probability of candidate A’s work being judged to be better than candidate B’s work by the moderator, βA is the true estimate of the quality of candidate A’s script and βB is the true estimate of candidate B’s script. Both βA and βB are on logit scales.

The candidates’ logit estimates of the true quality of their scripts (βA and βB) were derived by rescaling the true marks onto a logit scale, where the mean mark of the assessment (30) was set to zero and the standard deviation was set to 1.7. A standard deviation of 1.7 was chosen following the same rationale as in Bramley (Citation2015): ‘This corresponds to a probability of ~ 0.84 for a script 1 SD above the mean winning a comparison with the average script, and puts the 5th and 95th percentiles at ±3.33 logits. Choosing a higher or lower value than 1.7 would therefore correspond either to assuming a higher “true” spread of scripts, or higher discrimination for the judges’ (p. 7).

Then, to simulate the binary outcomes of comparative judgement, the probability of a win from Equationequation (1) was compared with a number randomly selected from a uniform distribution in the range 0–1. If the probability of candidate A winning their comparative judgement against candidate B was higher than the random number, then candidate A was coded as having won its comparison – this meant that it simulated a moderator judging Candidate A’s script to be better than candidate B’s script. If the probability was lower than the random number, then candidate A was coded as having lost its comparison against Candidate B.

We used a ‘round-robin’ pairing algorithm (Bramley, Citation2015) to generate the pairs of scripts for comparison (the pairs were selected at random and without repeated pairings), and we conducted several simulations with different numbers of judgements being made per script. We set the minimum number of judgements per script to 10 because it is rare for empirical CJ studies to use less than 10 judgements, with most using between 12 and 20 judgements (Bramley, Citation2015). The number of judgements reported in this article are 10, 20, 30, 40, 50, 100 and 200, as the results did not change substantially with more judgements.

Analyses

The CJ data was analysed separately for each simulation using the Extended Bradley-Terry Model in the R ‘sirt’ package (Robitzsch, Citation2018). This model produces a logit scale of script quality, calculating a value for each script on this scale (i.e. a CJ estimate of the script’s quality) based on the CJ judgement outcomes (i.e. the wins and losses).

In order to evaluate the feasibility and success of CJ as a moderation method, two approaches were taken in analysing these CJ estimates of script quality. First, we inspected the statistical characteristics of the CJ estimates to evaluate measurement characteristics of (statistical) validity and reliability. Second, we explored how these CJ estimates could then be used to assign ‘moderator’ marks to the scripts, as this would be needed to evaluate the accuracy of centres’ marks.

Statistical characteristics of the CJ estimates

In a first step, the CJ estimates of script quality were inspected with regard to their statistical distributions and their statistical relationship to the true estimates of script quality. This gave us information about the (statistical) validity of these estimates. In particular, descriptive statistics for the CJ estimates were compared to the descriptive statistics of the logits of the true marks to determine how well the CJ estimates aligned with the true estimates. Next, the reliability of the CJ estimates was examined by calculating the Scale Separation Reliability (SSR) coefficient using Equationequation (2) below.

(2) SSR=ObservedvariancemeansquarederrorObservedvariance(2)

Adding to our evaluations of validity, the SSR was compared to the true reliability of the CJ estimates. The true reliability was calculated as the square of the correlation between the CJ estimates and the true marks (Bramley, Citation2015).

Assigning ‘moderator’ marks from the CJ estimates

When CJ is used for moderation, whether the judgements are made by human moderators or simulated (as in our study), the CJ estimates need to be converted onto the same mark scale as the centre marks so that they can be directly compared for differences. We explored the use of linear regressionFootnote3 to assign these ‘moderator’ marks, deriving them from the relationship between the moderators’ CJ estimates and the centre marks. The general form of the regression model was as follows:

Y=β0+β1X

where Y is the outcome variable (centre marks), X is the predictor (CJ estimate), β0 is the intercept and β1 is the slope (or regression coefficient).

Regression has two advantages for this purpose. Regression models can generate predicted outcomes (marks) that are on a different scale to the predictors in the model (CJ estimates are on a logit scale). In addition, by using the centre marks as the outcome variable in the model, the predicted outcomes (the ‘moderator marks’) essentially become calibrations of the centre marks based on the moderators’ CJ estimates. This ultimately serves the objective of moderation which is to bring the centre marks to a common standard.

Once the regression models were run, we conducted a set of analyses to evaluate the characteristics of these ‘moderator’ marks. We, again, examined issues of (statistical) validity and reliability by comparing these moderator marks to the true marks. We also ran analyses comparing the moderator marks to the centre marks to determine the extent and type of mark differences, including how these varied by centre marking accuracy (e.g. accurate, severe, lenient). Throughout, we analysed how CJ design parameters (number of judgements and moderator sample size) affected the patterns of these results.

Results

The results section is divided into two parts. The first part focusses on the statistical characteristics of the CJ estimates while the second part presents the outcomes of the regression method used to assign moderator marks from the CJ estimates.

Statistical characteristics of the CJ estimates

This set of results shows the following statistical characteristics of the CJ estimates (logits) of script quality: their distributions, reliability and relationship with the true (simulated) logits and marks. The results additionally show how these characteristics varied as a function of the size of the dataset and the number of comparative judgements per script.

Statistical distributions of CJ estimates

presents the distributions of the CJ estimates for the largest dataset. The boxplots show the CJ estimates (left panel) and the standard errors of these estimates (right panel). With smaller numbers of judgements per script, the CJ estimates were far more spread out than the true logit values, having wider interquartile ranges. The estimates converged to the true distribution as the number of judgements per script increased; the interquartile range of the 200 judgement simulation was very similar to that of the true data.

Figure 1. Distributions of logit estimates (left panel) and standard errors of the logits (right panel) as a function of the number of judgements per script, shown for largest dataset.

Figure 1. Distributions of logit estimates (left panel) and standard errors of the logits (right panel) as a function of the number of judgements per script, shown for largest dataset.

The boxplots also highlight that there were many outliers amongst the CJ estimates, especially when the CJ estimates were based on smaller numbers of judgements. When more judgements were used there were fewer outliers. The same patterns were found for both the CJ estimates and the standard errors.

It is important to highlight that this pattern was very similar for datasets of all sizes (see and in the Appendix for boxplots of the smallest and medium datasets). This similarity indicates that it was the absolute number of judgements per script that affected the logit estimates rather than the number of judgements relative to the size of the dataset.

Statistical reliability of the CJ estimates

shows the scale separation reliability (SSR) coefficients for all simulations, as a function of the number of judgements per script. SSR coefficients ranged between 0.77 and 0.99. The SSR increased with larger numbers of judgements. When 20 or more judgements were used, all of the SSR coefficients were above 0.85. There was more variation in SSR amongst datasets when 10 and 20 judgements were used. In contrast, with 40, 50, 100 and 200 judgements, the SSR showed little variation amongst the datasets. There was little evidence of the SSR varying systematically by the size of the dataset. That is, for each number of judgements tested, the results for all the datasets clustered tightly around the level of reliability; there was no obvious separations of the results into any sub-groups within these clusters.

Figure 2. Scale separation reliability coefficients of all CJ simulations, as function of the number of comparative judgements per script.

Figure 2. Scale separation reliability coefficients of all CJ simulations, as function of the number of comparative judgements per script.

shows the relationship between the SSR and the true reliability for all datasets. When data points lie on the diagonal line, this means that the SSR is the same as the true reliability for those datasets. When data points are above (below) the diagonal line, the SSR is higher (lower) than the true reliability. The SSR values converged more consistently and more closely to the true reliability when higher numbers of judgements were used. With 40, 50, 100 or 200 judgements, the SSR values were almost identical to the true reliability to the extent that every SSR was within 0.05 of the true reliability. With the three smaller numbers of judgements (10, 20 and 30), all of the SSR coefficients were higher than the true reliability, suggesting that fewer judgements may overestimate the reliability of the CJ estimates. Findings of reliability inflation when low numbers of judgements are carried out have been reported in other CJ studies (Bramley, Citation2015). One possible explanation for this (following the suggestion made by Bramley and Vitello (Citation2018) to explain inflated reliability for adaptive CJ) is that random pair allocation does not systematically pair scripts. Therefore, when using small numbers of judgements, scripts may not be compared to enough scripts of similar and different levels of difficulty, which reduces the opportunities for judges to encounter (and record) ‘disconfirming evidence’ of the quality of that script (Bramley & Vitello, Citation2018, p. 45).

Figure 3. Relationship between SSR and ‘true’ reliability of CJ simulations, as function of the number of comparative judgements per script.

Figure 3. Relationship between SSR and ‘true’ reliability of CJ simulations, as function of the number of comparative judgements per script.

Generating moderator marks from the CJ estimates using regression

This section describes how the moderator marks were calculated after the CJ analyses. Based on the results from the previous section, this analysis only considered datasets with 300 centres (due to little variation between the different sized datasets), and only considered numbers of judgements per script up to 40 (as the reliability estimates did not increase notably above this). Analyses were carried out for the three different moderation sample sizes to compare the effects of this design parameter: total number of scripts that would be requested under the current protocol (current max: up to 20); six scripts (small fixed sample size); and 10 scripts (moderate fixed sample size).

Linear regression was used to generate the moderator marks based on the relationship between the CJ estimates (predictor variable) and the centre marks (outcome variable). The regression method calculates the best-fitting line for the observed data by minimising the sum of the squares of the vertical deviations from each data point to the line. As a result, moderator marks for scripts can turn out be different to the centre marks, as the aim of this method is to align the centre marks to the same session ‘common’ standard as reflected in the moderators’ CJ estimates.

To evaluate the success of the regression method, we first assessed the statistical validity of the moderator marks estimated from the regression model by comparing these marks to the true marks we had simulated. below shows the correlations between both sets of marks for different sizes of moderation samples and different numbers of judgements per script. The strength of the relationships showed in also allowed us to recommend a minimum number of moderated scripts and minimum number of judgements per script that could be viable for practical use.

Table 3. Moderator marks vs. true marks correlations.

shows that a fixed moderation sample of 10 scripts produced similar correlations to the ‘current max’ sample, particularly when the number of judgements was higher than 20. The smaller sample of six scripts per centre produced the worst results, independently of the number of judgements per script. This suggests that using a moderation of 10 scripts is a viable alternative to the current max and preferable to the smaller sample of 6. Furthermore, the correlations in highlight the impact of the number of judgements. For each moderation sample size, the correlations increased as the number of judgements increased. The correlations were high (over 0.70) for all moderation sample sizes when 20 or more judgements were carried out, which highlights the need to have at least 20 judgements per script to get closer relationships between the moderator marks and the true marks.

Scatterplots showing the relationships between the moderator marks and the true marks are presented in in the Appendix.

Next, we turn our attention to the relationship between the moderator marks and the centre marks, as this would be the focus of current moderation procedures which do not have access to candidates’ ‘true marks’ unlike in simulations. First, we show the levels of similarity in the rank orders of scripts between the moderator and centre marks and then we show the numerical differences between these marks.

Evaluating the rank order of scripts within each centre is a particularly important feature of the existing moderation process, as the aim is to maintain the rank order unless there are significant marking issues. Thus, we evaluated the extent to which the CJ procedure (including the regression method) affected the rank order of scripts by calculating within-centre correlations between centre marks before and after the CJ analyses (i.e. original centre marks vs. moderator marks). We compared how these correlations varied by CJ design parameters, in order to identify which parameters would be needed to ensure that genuine differences in the rank order of scripts could be revealed. This is because statistical properties of data can create spurious effects on rank orders, which need to be minimised where possible to maximise the validity of results. We also assessed how these correlations were affected by other features such as centre marking accuracy type (e.g. accurate, lenient, severe) for which we expected that erratic centres would be more likely to see changes in their rank orders.

The results of these analyses are shown in . There is one chart for each moderation sample (n = current max, n = 10 or n = 6). Within each chart, the distributions of correlations are shown separately for each group of centres according to their type of marking accuracy and whether the results were based on 10, 20, 30 or 40 judgements per script. The correlations were stronger (i.e. closer to one) for higher numbers of judgements per script, independently of the size of the moderation sample and the level of marking accuracy in the centre. This is not surprising based on the results we have presented so far. In particular, having 40 judgements per scripts produced very high correlations of 0.8 and above. As having 40 judgements per script would be time consuming and costly, a lower number may be viable. Twenty judgements per script appears to be a reasonable minimum, as it also produced high correlations of around 0.8 and these were much closer to a correlation of 1.0 than for 10 judgements per script. Note that we found some low and negative correlations among the results, but these corresponded to very small centres, where changes in the rank order are more likely to occur.

Figure 4. Correlations between centre marks before and after CJ (centre level) shown for each moderation sample size (n).

Figure 4. Correlations between centre marks before and after CJ (centre level) shown for each moderation sample size (n).

The analyses showed some differences between different moderation samples too. For the fixed moderation sample of six scripts per centre (i.e. n = 6 in ), the average correlations in each group of centres (e.g. accurate, erratic, little lenient, etc.) were typically lower than for larger moderation samples. The ranges of these correlations were also wider and there were more outliers (i.e. centres with much lower correlations). The results for the current max sample size did not appear very different to the results for a fixed sample of 10 scripts per centre; perhaps the ranges were a little narrower but, on average, correlations were very similar. This would therefore support using a fixed moderation sample of 10 scripts per centre in a CJ moderation protocol.

Thus, from the point of view of maintaining the rank order of the scripts within a centre, it seems as if this method of assigning marks to scripts using the CJ estimates could work.

We now turn to the numerical differences in marks. It was important to determine how much the moderator marks differed relative to the centre marks as well as to the true marks. If the CJ procedure had identified and corrected the ‘error’ that we had added when simulating the centre marks, we would expect the moderator marks to be closer to the true marks than to the centre marks.

We show the results of these comparisons between the moderator marks and the centre marks for the moderation sample of 10 scripts per centre, given the earlier results, and break these results down by the level of marking accuracy of the centres (). For comparison, it also shows the differences, on average, between centre marks and true marks for centres of each level of marking accuracy (that is, the error introduced in the simulation).

Table 4. Differences between centre marks before and after CJ (moderation sample n = 10).

shows that the moderator marks in accurate centres were, on average, between 2.5 and 3 marks higher than the centre marks. While we might expect there to be no difference (as for accurate centres no ‘error’ was introduced when simulating centre marks), the presence of many lenient centres influenced the regression. In current practice, mark adjustments following moderation are only made to a centre’s marking if the adjustments fall outside a set amount or tolerance (Gill, Citation2015). The mark shift seen here is within the current tolerance for the reference set (4 marks). For little lenient centres, moderator marks were just one mark higher on average than centre marks, again within tolerance. For little severe centres, the moderator differences were larger; the moderator marks were around 5 marks higher than the centre marks. The biggest differences made by the moderation were in strong severe centres, where estimated marks were, on average, around 10 marks higher.

We can use the values in to calculate how much the moderator marks differ from the true mark, to understand the extent to which the CJ procedure has adjusted the error introduced into the centre marks. In each case, we simply need to add the moderator-centre difference to the centre-true difference. For example, for little lenient centres using 20 judgements, the moderator-centre difference shows that moderation via CJ would add 1.18 marks on average to the centre marks, and the centre-true difference shows that the centre marks were already 2.00 marks over the true marks. Together, this means that the final mark in little lenient centres would be, on average, 3.18 marks (2.00 marks + 1.18 marks) from the true marks. Similarly, for little severe centres, and again in the case of 20 judgements per script, the moderation would add 4.22 marks on average, making the final marks in these centres, on average around 2.17 (-2.05 marks + 4.22 marks) higher than the true marks. Overall, the results in show that, although the moderation does not correct for all the ‘error’ introduced when simulating the centres marks, it has calibrated the centres to a common standard independently of the level of marking accuracy. It is important to note that we usually do not know true marks in real (non-simulated) assessment sessions. These results show that calibrating centres to each other using regression is a feasible way to achieve fairness between centres that have different, but unknown, levels of marking accuracy.

also shows that, although the average difference between centre marks before and after CJ did not change much when the number of judgements per script increased, the standard deviation did, decreasing considerably with more judgements.

Finally, shows the average moderator marks for the centres with different levels of marking accuracy alongside the average centre marks (for a moderation sample of 10 centres and 20 judgements per script). Arrows highlight how the average centre marks shifted with the moderation process. We can see that the regression method has calibrated the centres to a common session standard; the six different types of centre have very similar average moderator marks (approximately 32 marks), with strong lenient and strong severe incurring the biggest shifts from their centre marks.

Figure 5. Average marks before CJ (centre marks) and after CJ (moderator marks), broken down by level of accuracy of the centres.

Figure 5. Average marks before CJ (centre marks) and after CJ (moderator marks), broken down by level of accuracy of the centres.

Discussion

In this study we set out to investigate, using a simulation approach, whether CJ could be used for the moderation of NEAs. Our focus was the adjustment of marks by a moderator; subsequent centre-wide adjustment of marks was beyond the scope of this research. Centre-wide adjustment is currently achieved using linear regression (see Gill, Citation2015) and we suggest starting with this method in studies investigating centre-wide adjustment of marks.

We evaluated the feasibility of the CJ method in two ways: first we looked at the reliability of the CJ estimates and how they compared to the true (simulated) data and second we investigated how the CJ estimates could be used to assign marks to scripts. The findings are discussed below with reference to the specific research questions.

What is the minimum number of judgements needed to generate ‘usable’ moderation marks?

We found that the accuracy of the CJ estimates increased as the number of judgements increased; this was both in terms of the distribution of values, which converged to the true distribution, and in the reduction in the number of outliers. Interestingly, we found that it was the absolute number of judgements per script that affected these estimates rather than the number of judgements relative to the size of the dataset. The reliability of these estimates (SSR coefficient) also increased with higher numbers of judgements, with all coefficients above 0.85 when 20 or more judgements were used. This finding from a simulation exercise is in line with the meta-analysis performed by Verhavert et al. (Citation2019), who also found that reliability increased with the number of comparisons and found that a reliability of .90 needed 26–37 comparisons.

The relationship between SSR and true reliability also became stronger with increased numbers of judgements. When 40 or more judgements were made, the SSR values were very similar to the true reliability. Judgements of 20 or below, in particular, appeared to overestimate the reliability of the CJ estimates relative to the true reliability (see Bramley (Citation2015) for discussion of this issue).

These results suggest that an ideal number of judgements would be 40 or above. However, as having 40 judgements per script would be very time consuming and, thus costly, the use of a minimum of 20 judgements could be acceptable. The case for a minimum of 20 judgements was further supported by the second set of analyses, which evaluated the CJ estimates with regard to how well they could be used to assign marks to the scripts. In these analyses, which used regression to assign moderator marks based on the CJ estimates, 20 judgements seemed to preserve the rank order of scripts within the centres relatively well, and was able to identify and correct the error (to a certain extent) that we introduced to the simulated true marks when generating the centre marks. In fact, correlations between centre marks and moderator marks were quite close to one for this number of judgements (20) whereas correlations for 10 judgements produced weaker correlations. This ties in with previous research studies which have used between 12 and 20 judgements per script (Bramley, Citation2015).

What is the minimum size of moderation sample needed to generate ‘usable’ moderation marks?

Correlations between centre and moderator marks (at centre level) were typically lower and more variable for a moderation sample size of six scripts per centre than for larger moderation samples. The results for the current max sample size (i.e. all scripts from small centres, 15 from medium centres, 20 from large centres) did not appear very different to the results for a fixed sample of 10 scripts per centre. Together, these results therefore support a fixed moderation sample of 10 scripts per centre regardless of centre entry size. The use of this value would simplify the sample selection process for centres.

Is CJ for moderation a feasible method?

The method of assigning moderator marks using linear regression appeared to work well. The rank order of the scripts within centres was maintained, and ‘usable’ moderator marks could be achieved with a moderation sample of 10 scripts per centre and 20 judgements per script. The analyses also showed how the CJ procedure worked for different centre types (e.g. accurate, erratic, little lenient, etc.). The centres we had simulated as strong lenient or strong severe incurred the biggest mark adjustments after CJ. Erratic centres, as we would expect, had lower correlations between marks before and after CJ indicating that the rank order was not fully maintained.

The regression method calibrated centres to a common session standard and, as a result, even marks for accurate centres were adjusted. If we were going to use this method to moderate NEAs, the ‘standard’ might vary from year to year as the regression is solely based on CJ estimates of the scripts submitted in any one year. These two issues could be mitigated in several ways. First, a tolerance could be applied to the adjustments such that those within tolerance do not need to be made. This would mean that mark changes would be more likely to occur in strong lenient and/or strong severe centres than in centres that mark more accurately. Second, there could be an enhanced focus in centres on setting the standard to reduce the likelihood of adjustments. Third, scripts from a previous year could be included to anchor the standard from this year to that of the previous year. However, this would mean additional scripts would need to be included, thus increasing moderating time.

When interpreting the results of this study, it should be remembered that we cannot determine how long the CJ moderation task would take (e.g. the time each judge spends making each paired comparison) and thus whether using the numbers of judgements and scripts suggested would be practical in terms of time and cost. An experimental study would be needed to investigate this (see Vidal Rodeiro & Chambers, (Citation2022) for such as study).

Limitations

This study used a simulation approach. Simulation allowed us to test the effects of a variety of different design parameters (e.g. centre entry size, moderation sample size and number of judgements per script) on the CJ estimates of script quality, and allowed us to select minimum values for practical use. However, because no empirical data and no human participants were used in the CJ procedure, it is possible that the effects of these parameters may differ in a real-life empirical study (Dorans, Citation2014; Feinberg & Rubright, Citation2016). For example, one potential difference between our simulation study and an empirical study is the statistical distribution of the candidate data (normal vs. skewed), which affects how many scripts of particular levels of difficulty are available and may be paired together. Another difference between simulation and empirical studies is how the comparative judgements are made. Human participants may discriminate between scripts in a way that is different from the simulation algorithm – see Leech & Chambers, (Citation2022) for a discussion of how judges make their decisions.

Nevertheless, investigating CJ design parameters via simulation is far more cost effective than establishing this experimentally and more informative than using an educated guess. These simulation values can then be used as a basis for experimental research, especially as previous studies have shown that CJ simulations can produce similar results to empirical CJ studies (Bramley, Citation2015; Bramley & Vitello, Citation2018).

In summary, from this study, it would appear that the use of CJ for moderation does have the potential to enable us to capitalise on technological advances and thus shape a more efficient moderation process. CJ allows candidate work to be moderated in parallel, across multiple moderators, removing centre-based restrictions, ensuring a consistent marking standard and removing the administrative burden on moderatos (not having to request additional samples or input marks). The next step is to use the results of this study together with another exploring the practical feasibility of the method (in terms of time taken, its use on larger bodies of work and judge confidence, see Vidal Rodeiro & Chambers, Citation2022) to carry out a full end to end pilot study of the method, including centre-wide adjustments.

In addition to potentially improving the current moderation process in England as described above, the study has wider implications. It suggests some useful design parameters that can be used in CJ in the context of moderating NEAs, it illustrates the usefulness of simulation in feasibility studies, and it adds to the body of work on CJ as a method of maintaining standards.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Notes on contributors

Lucy Chambers

Lucy Chambers is a Principal Research Officer in the Assessment Research and Development division at Cambridge University Press & Assessment. She leads the research strand on qualifications and assessment. Her current interests include the moderation of school-based assessment, investigating the validity of comparative judgement and research data management.

Sylvia Vitello

Sylvia Vitello is a Senior Researcher in the Assessment Research and Development division at Cambridge University Press & Assessment. She holds a PhD in Experimental Psychology from University College London. Her current research focuses on assessment design for general and vocational qualifications, alternative forms of assessment and standard maintaining such as comparative judgement as well as broader education topics such as the meaning and development of competence as well as equality, diversity and inclusion.

Carmen Vidal Rodeiro

Carmen Vidal Rodeiro is a Senior Researcher in the Assessment Research and Development division at Cambridge University Press & Assessment. She holds a PhD in Statistics from the University of Aberdeen. Her current research focuses on aspects of educational measurement and qualification reform, accessibility, standards monitoring, the validity of assessments as predictors of university and career success, the use of comparative judgement for the moderation of school-based assessments, and progression routes from vocational and general education.

Notes

1. Centre refers to the institution running the assessment, usually an individual school.

2. SAS uses seed values to select numbers ‘randomly’. Fixed seed values were used for all the datasets described in this article so that the results of the simulations could be replicated.

3. Linear regression is reported for simplicity; multilevel and non-parametric regression techniques were also explored and produced similar results.

4. Note that if the rank ordering of the scripts were exactly the same, then all data points would fall on the straight diagonal line of the graphs.

References

  • Benton, T., Gill, T., Hughes, S., & Leech, T. (2022). A summary of OCR’s pilots of the use of comparative judgement in setting grade boundaries. Research Matters: A Cambridge Assessment Publication, 33, 10–30. https://doi.org/10.17863/CAM.100424
  • Billington, L. (2009). Principles of moderation of internal assessment. AQA Centre for Education Research and Policy.
  • Bramley, T. (2007). Paired comparison methods. In Techniques for monitoring the comparability of examination standards (pp. 246–300). QCA.
  • Bramley, T. (2015). Investigating the reliability of adaptive comparative judgment. Cambridge Assessment Research Report. Cambridge, UK: Cambridge Assessment.
  • Bramley, T., & Vitello, S. (2018). The effect of adaptivity on the reliability coefficient in adaptive comparative judgement. Assessment in Education Principles, Policy & Practice, 26(1), 43–58. https://doi.org/10.1080/0969594X.2017.1418734
  • Curcin, M., Howard, E., Sully, K., & Black, B. (2019). Improving awarding: 2018/2019 pilots Ofqual.
  • Dorans, N. J. (2014). Simulate to understand models, not nature (Research Report No. RR-14-16). Educational Testing Service.
  • Feinberg, R. A., & Rubright, J. D. (2016). Conducting simulation studies in psychometrics. Educational Measurement Issues & Practice, 35(2), 36–49. https://doi.org/10.1111/emip.12111
  • Gill, T. (2015). The moderation of coursework and controlled assessment: A summary. Research Matters: A Cambridge Assessment Publication, 19, 26–31. https://doi.org/10.17863/CAM.100323
  • Heldsinger, S., & Humphry, S. (2010). Using the method of pairwise comparison to obtain reliable teacher assessments. The Australian Educational Researcher, 37(2), 1–19. https://doi.org/10.1007/BF03216919
  • JCQ. (2019a). Instructions for conducting coursework 2019-2020. Joint Council for Qualifications.
  • Jones, I., Wheadon, C., Humphries, S., & Inglis, M. (2016). Fifty years of A-level mathematics: Have standards changed? British Educational Research Journal, 42(4), 543–560. https://doi.org/10.1002/berj.3224
  • Leech, T., & Chambers, L. (2022). How do judges in comparative judgement exercises make their judgements? Research Matters, 33, 31–47. https://doi.org/10.17863/CAM.100426
  • Ofqual. (2018). Marking reliability studies 2017.
  • Pollitt, A. (2012a). Comparative judgement for assessment. International Journal of Technology and Design Education, 22(2), 157–170. https://doi.org/10.1007/s10798-011-9189-x
  • Pollitt, A. (2012b). The method of adaptive comparative judgement. Assessment in Education Principles, Policy & Practice, 19(3), 281–300. https://doi.org/10.1080/0969594X.2012.665354
  • Pollitt, A., & Crisp, V. (2004, September 15-18). Could comparative judgements of script quality replace traditional marking and improve the validity of exam questions? Paper presentation, British Educational Research Association Annual Conference, UMIST.
  • Robitzsch, A. (2018). Package ‘sirt’: Supplementary item response theory models. R package version, 1.10-10.
  • SAS Institute. (2017) . Base SAS 9.4 procedures Guide: Statistical procedures.
  • Steedle, J. T., & Ferrara, S. (2016). Evaluating comparative judgment as an approach to essay scoring. Applied Measurement in Education, 29(3), 211–223. https://doi.org/10.1080/08957347.2016.1171769
  • Tarricone, P., & Newhouse, C. P. (2016). Using comparative judgement and online technologies in the assessment and measurement of creative performance and capability. International Journal of Educational Technology in Higher Education, 13(1), 1–11. https://doi.org/10.1186/s41239-016-0018-x
  • Verhavert, S., Bouwer, R., Donche, V., & De Maeyer, S. (2019). A meta-analysis on the reliability of comparative judgement. Assessment in Education Principles, Policy & Practice, 26(5), 541–562. https://doi.org/10.1080/0969594X.2019.1602027
  • Vidal Rodeiro, C., & Chambers, L. (2022). Moderation of non-exam assessments: Is comparative judgement a practical alternative? Research Matters, 33, 100–119. https://doi.org/10.17863/CAM.100428
  • Vitello, S., & Williamson, J. (2017). Formal definitions of assessment types. Reforms to qualifications: Factsheet 1. Cambridge Assessment.
  • Walland, E. (2022). Judges’ views on pairwise comparative judgement and rank ordering as alternatives to analytical essay marking. Research Matters: A Cambridge Assessment Publication, 33, 48–67. https://doi.org/10.17863/CAM.100427
  • Wheadon, C., Barmby, P., Christodoulou, D., & Henderson, B. (2020). A comparative judgement approach to the large-scale assessment of primary writing in England. Assessment in Education Principles, Policy & Practice, 27(1), 46–64. https://doi.org/10.1080/0969594X.2019.1700212

Appendix

Figure A1. Distributions of logit estimates (left panel) and standard errors of the logits (right panel) as a function of the number of judgements per script, shown for median dataset.

Figure A1. Distributions of logit estimates (left panel) and standard errors of the logits (right panel) as a function of the number of judgements per script, shown for median dataset.

Figure A2. Distributions of logit estimates (left panel) and standard errors of the logits (right panel) as a function of the number of judgements per script, shown for smallest dataset.

Figure A2. Distributions of logit estimates (left panel) and standard errors of the logits (right panel) as a function of the number of judgements per script, shown for smallest dataset.

Figure A3. Moderator marks vs. true marks ~ moderation sample size n=current maxFootnote4.

Figure A3. Moderator marks vs. true marks ~ moderation sample size n=current maxFootnote4.

Figure A4. Moderator marks vs. true marks ~ moderation sample size n = 104.

Figure A4. Moderator marks vs. true marks ~ moderation sample size n = 104.

Figure A5. Moderator marks vs. true marks ~ moderation sample size n = 64.

Figure A5. Moderator marks vs. true marks ~ moderation sample size n = 64.