Abstract
This article outlines an empirical investigation into equivalent forms reliability using a case study of a national curriculum reading test. Within the situation being studied, there has been a genuine attempt to create several equivalent forms and so it is of interest to compare the actual behaviour of the relationship between these forms to the relationship that would be expected theoretically. As such, this article not only reports the estimated levels of reliability based on several equivalent forms but also compares these to the levels of that would be expected from internal reliability estimates. Reliability is quantified both in terms of correlations and in terms of decision consistency. Results of analysis show a relatively good agreement between internal and equivalent forms estimates of reliability. The small discrepancies that exist tend to involve the internal estimates of reliability being higher than the equivalent forms estimates. Differences between the estimates are likely to be caused by sources of measurement error that are not captured in internal estimates. Within the situation under investigation, this may include marking and occasion-related errors.
Notes
1. The actual terminology used in the 1904 Spearman paper is somewhat different to that used in modern reliability literature and in fact the term ‘reliability’ is not used at all. Furthermore, Spearman’s paper was concerned with quantitative measures of all kinds and not purely with educational assessment. However, the ideas underpinning the modern understanding of reliability are clearly present.
2. Technically EFR will apply to a pair of tests. However, our focus throughout this paper is the reliability of the pre-test as in practice we are most interested in getting an idea of how reliability this form of the test will perform if it we chose to use it in a high-stakes setting. We are less interested in the reliability in the other test forms that will be introduced later in this paper and as such will not be discussing these in any detail.
3. In all of these studies internal reliability was calculated via coefficient alpha or KR21 (Kuder and Richardson 1937).
4. The remaining pupils instead took a writing test (the second half of a key stage 2 English test). This was done in order to gather data about the overall functioning of the English tests.
5. Statistics on teacher assessment levels are not included in this table as it is not a test form.
6. Both the pre-test and the anchor test were marked by the same group of markers although given the number of markers involved in such an exercise it is unlikely that a pupil’s pre-test would be marked by the same marker as their anchor test. All of those involved in marking these tests were also involved in the (much larger) national marking exercise for the live key stage 2 reading tests and could be assumed to be of a similar standard.
7. Newton (Citation2009) published a correlation of 0.80 between scores on the pretest of the 2006 key stage 2 reading test and the live 2005 key stage 2. This indicates that the level of external reliability found for our reading test is typical of the level that we would expect.
8. The exact equation is y = m + α(x − m) + ϵ. Where y is the score on the equivalent form, x is the score on the current form, α is the reliability of the current form, m is the mean score of the current form and ϵ is a normally distributed error term with a mean of zero and a standard deviation equal to the standard deviation of x times the square root of 1 − α 2.
9. This of course cannot be absolutely true for the test in question as test scores must be whole numbers. Within the technique presented here this is simply overcome by simply rounding predicted test scores.
10. Similar analysis for the pretest of the 2006 key stage 2 reading test in Newton (Citation2009) found a consistency level of 73%.