Abstract
The accuracy of achievement test score inferences largely depends on the sensitivity of scores to instruction focused on tested objectives. Sensitivity requirements are particularly challenging for standards-based assessments because a variety of plausible instructional differences across classrooms must be detected. For this study, we developed a new method for capturing the alignment between how teachers bring standards to life in their classrooms and how the standards are defined on a test. Teachers were asked to report the degree to which they emphasized the state's academic standards, and to describe how they taught certain objectives from the standards. Two curriculum experts judged the alignment between how teachers brought the objectives to life in their classrooms and how the objectives were operationalized on the state test. Emphasis alone did not account for achievement differences among classrooms. The best predictors of classroom achievement were the match between how the standards were taught and tested, and the interaction between emphasis and match, indicating that test scores were sensitive to instruction of the standards, but in a narrow sense.
ACKNOWLEDGMENTS
Preparation of this article was supported by a grant from the Arizona Department of Education (Agreement No. 05-25-ED) to conduct research on the Arizona Instrument to Measure Standards (AIMS) tests.
Notes
1We use the terms instructional sensitivity and instructional validity interchangeably throughout this article.
2Before 2005, AIMS exams were administered in Grades 3, 5, 8, and high school. Sampled students were administered the Stanford Achievement Test, Ninth Edition, in reading and mathematics in fourth grade.
3Besides masking the study purpose, our initial intent was to develop a factor indicating degree of curricular dilution in each classroom from the five third-grade POs on the survey, and an indicator of teacher exaggeration from the five eighth-grade or high school objectives. After attempts to develop each scale with Rasch analysis failed due to very low internal consistency values and item misfit problems, we decided to forgo further analysis with those survey items. In a sense, we interpreted the inability to scale the off-grade POs as an indication that teachers, indeed, focused mostly on the POs they were expected to teach their students. But the lack of items per factor likely contributed to scaling problems as well.