Abstract
We demonstrate how to assess the potential changes to a test's score scale necessitated by changes to the test specifications when a field study is not feasible. We used a licensure test, which is currently under revision, as an example. We created two research forms from an actual form of the test. One research form was developed with the current specifications and the other was developed with the redesigned (new) specifications in terms of the proportion (not number) of items in each category. We examined whether the current and redesigned tests measure the same construct and have the same level of reliability. Then we used subpopulation invariance indices to assess the equatability of the redesigned test to the current test using data sets from actual operational administrations of the current test. The results suggest that the change in test specifications might be great enough that the current and redesigned test scores could not be considered exchangeable across the score range. However, the score scales for the two tests appear to coincide in the cut-score region.
Notes
1. Three problematic items were not scored in this form.
2. We cannot assemble those forms without any item overlap across them due to item shortages, particularly in categories I and IV.
3. Not all stakeholders use the same cut scores for the test.
4. We also calculated all deviance measures at the raw-score level. Although the patterns were similar, the magnitude of those measures was generally lower at the raw-score than at the scaled-score level. We can make these results available upon request.
5. These estimates rely on computations similar to the Spearman-Brown prophesy formula, which is used to estimate the reliability of a shortened or lengthened test form. One of the major assumptions of this methodology is that the current and redesigned tests comprise multiple, identical building blocks (either items or groups of items). By employing these formulas, we can estimate the properties of a test with the same content areas, but in different proportions than the current test.
6. We computed expected classification consistency using a simple model with two dichotomized, identically distributed, bivariate normally distributed random variables with correlation set equal to the test reliability of .89. These expected results are consistent with more complex models (e.g., see Keats, Citation1957; Livingston & Lewis, Citation1995).