ABSTRACT
Differential Item Functioning (DIF) detection procedures provide validity evidence for proposed interpretations of test scores that can help researchers and practitioners ensure that test scores are free from potential bias, and that individual items do not create an advantage for any subgroup of examinees over another. In this study, we use the Rasch separate calibration t-test method to examine the effects of different levels of contrast at varying levels of statistical significance on the flagging of items across multiple examination administrations. We assert that if DIF is a stable trait of an item, it should be sample-independent and detected each time it is administered. We examine the consistency of different alpha levels and critical values in identifying DIF for the same items across multiple administrations. Our results suggest that, using our most lenient criteria, approximately 40% of items on any exam administration may be flagged for DIF, but this drops to 20% when considering items across two administrations, and 12% across three administrations. Testing organizations can use the methods illustrated here to set their own thresholds for DIF, which may be useful if organizations need to estimate the time and cost associated with having independent reviewers examine items.
Acknowledgements
We would like to thank Bradley Brossman, Robert Furter, and Andrew Jones for their helpful comments on previous versions of this manuscript.
Note
Each author contributed equally to this study. The author order is presented alphabetically.