1,211
Views
32
CrossRef citations to date
0
Altmetric
Original Articles

Comparability of GCSE examinations in different subjects: an application of the Rasch model

Pages 609-636 | Published online: 15 Sep 2008
 

Abstract

The comparability of examinations in different subjects has been a controversial topic for many years and a number of criticisms have been made of statistical approaches to estimating the ‘difficulties’ of achieving particular grades in different subjects. This paper argues that if comparability is understood in terms of a linking construct then many of these problems are resolved. The Rasch model was applied to an analysis of data from over 600,000 candidates who took the General Certificate of Secondary Education (GCSE) examinations in England in 2004. Thirty‐four GCSE subjects were included in the final model, which estimated the relative difficulty of each grade in each subject. Other subjects failed to fit, as did the fail grade, U. Significant overall differences were found, with some subjects more than a grade harder than others, though the difficulty of a subject varied appreciably for different grades. The gaps between the highest grades were on average twice as big as those between the bottom grades. Differential item functioning (DIF) was found for male and female candidates in some subjects, though it was small in relation to variation across subjects. Implications of these findings for various uses of examination grades are discussed.

Notes

1. Subject Pairs Analysis refers to a group of methods that compare the grades achieved by all candidates who took a particular pair of subjects, often aggregating over all possible pairs, to produce an estimate of each subject’s relative difficulty. See Coe (Citation2007) for further explanation.

2. The words ‘difficulty’ and ‘ability’ are used generally in discussing the Rasch model, even when their normal meanings are considerably stretched. For example, in the context of a Likert scale attitude item one may talk about the ‘difficulty’ of an item to mean its tendency to be disagreed with (i.e. how ‘hard’ it is to agree with). The use of these words may initially seem strange to anyone not familiar with the Rasch model. However, I have adopted this convention, partly in order to comply with standard practice, and partly because although the words ‘difficulty’ and ‘ability’ are not quite right for the interpretation intended, I am unable to think of better ones.

3. The odds ratio is the ratio of the odds of the two probabilities. In other words if a person has probabilities p and q of success on two items, the odds are (1 − p)/p and (1 − q)/q respectively. Hence the odds ratio is [ (1 − p)/p ] / [ (1 − q)/q ]. The logit function is

so the log of the odds ratio is the same as the difference in the two logits, logit(p) − logit(q).

4. The partial credit model used here is:

where

  • Pnij is the probability that person n encountering item i is observed in category j;

  • Bn is the ability measure of person n;

  • Di is the difficulty measure of item i, the point where the highest and lowest categories of the item are equally probable;

  • Fij is the ‘calibration’ measure for item i of category j relative to category j‐1, the point where categories j‐1 and j are equally probable relative to the measure of the item. (Linacre, Citation2005b)

In WINSTEPS the partial credit model is invoked by treating each item as a separate group, using the specification ISGROUPS=0

5. WINSTEPS estimates of reliability are analogous to, but generally underestimates of, internal consistency measures such as Cronbach’s alpha (Linacre, Citation2005b).

6. There are different ways this might have been done. One of the reviewers of this paper suggested rescaling the logit scale such that the sample of rescaled person ability measures had the same standard deviation as the set of their mean GCSE scores (i.e. the mean score, for each person, of all subjects they have taken, using A∗=8, A=7, etc). The justification for this would be that if we interpret the Rasch measure as an indication of person ability then it makes sense to say that the difference in ability between a person who achieves an average of, say C grades and another who achieves an average of B grades should equal ‘one grade’ on our recalibrated scale. However, if we emphasise the interpretation of Rasch measure as an indication of grade difficulty, as I have done, then ‘one grade’ on a recalibrated scale should represent the average difference between the difficulty of adjacent grades, across all grades and all subjects. Fortunately, the two are very close, so it makes little practical difference. The former method produces an ‘average grade’ interval that is 92% of the size of the latter.

7. The latest (at the time of writing) rationale for the way fail grades are to be treated for value‐added purposes at A level is in LSC (Citation2006). An account of the controversy around this issue can be found on BBC News online at http://news.bbc.co.uk/1/hi/education/5134612.stm. The Contextual Value Added (CVA) model used in England is explained at http://www.standards.dfes.gov.uk/performance/word/GuidetoCVA2006.2.doc?version=1

Log in via your institution

Log in to Taylor & Francis Online

PDF download + Online access

  • 48 hours access to article PDF & online version
  • Article PDF can be downloaded
  • Article PDF can be printed
USD 53.00 Add to cart

Issue Purchase

  • 30 days online access to complete issue
  • Article PDFs can be downloaded
  • Article PDFs can be printed
USD 385.00 Add to cart

* Local tax will be added as applicable

Related Research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.