Abstract
Using an empirically-based simulation study, we show that typically used methods of choosing an item calibration sample have significant impacts on achievement bias and system rankings. We examine whether recent PISA accommodations, especially for lower performing participants, can mitigate some of this bias. Our findings indicate that standard operational methods, while not ideal, recover underlying proficiency reasonably well and generally outperform methods that more completely include all participants. Translating results onto the PISA scale, the calibration sample can induce bias of up to 12.49 points, which is important given that standard errors are around three points. Although ranking correlations are at least.95, we note the policy implications of slight ranking changes. Our findings indicate that limited accommodations targeted at low achieving educational systems do not outperform either of the other methods considered. Research that further explores accommodations for heterogeneous populations is recommended.
ACKNOWLEDGMENTS
We would like to thank Dr. Avi Allalouf, as editor, three anonymous reviewers, and Mr. Justin Wild for valuable feedback and comments on this manuscript. Any remaining errors are our sole responsibility.
Notes
Given the complexity and ambiguity in conceptions of the “nation-state,” particularly for city-states, non-national systems, or territories with disputed or ambiguous political status, we refer to PISA participating units as educational systems. Examples of geographic areas with special status include Dubai (an emirate within the United Arab Emirates), Taiwan (a geographic area with a disputed or ambiguous political status), Hong Kong and Macao (special administrative regions of China), and Singapore (a city-state).