Abstract:
Some form of a short interrupted time series (ITS) is often used to evaluate state and national programs. An ITS design with a single treatment group assumes that the pretest functional form can be validly estimated and extrapolated into the postintervention period where it provides a valid counterfactual. This assumption is problematic. Ambiguous preintervention functional forms are common, as are other factors affecting posttest means and slopes. Using No Child Left Behind as an example, we demonstrate how adding multiple design elements to the basic ITS structure serves to promote causal inference by limiting alternative interpretations. No added design element is perfect by itself, but we argue that they collectively provide a strong causal warrant when the predictions they engender are complex, the results “cohere” with the predictions, and no alternative can fit the same pattern of predictions even if it can fit some of them.
Notes
We exclude New York because it uses its own state proficiency scale that is not based on a 0 to 100% proficiency scale that other states use. We also exclude Vermont because it has no state assessment data for the years examined.
We also ran all analyses using a random effects model and got essentially the same results with respect to both point estimates and statistical significance levels.
The models were not weighted by student population or the inverse sampling variance of NAEP estimates. This was because states are the unit of analysis and little variation results when standard deviations are examined separately across states and years.
We use NAEP-provided grade- and subject-specific standard deviations from individual student test score data.
Percentile rank gains reflect the number of ranks a state would have risen relative to other states by virtue of its NCLB gains. Gains in percentile rank are based on the distribution of state rank in 2002.
We translate the study's obtained effect sizes to months of learning. Analyses of nationally normed tests by Hill et al. (2007) show that the average annual test score gain in effect size from fourth to fifth grade is roughly 0.40 standard deviation units for reading and 0.56 for math. A much smaller gain of 0.22 is observed for the average test score gain from eighth-grade to ninth-grade math. So an obtained effect size of 0.20 SD in fourth-grade reading translates to 6 months’ worth of learning based on the benchmark effect size of 0.40 (i.e., 0.20/0.40 × 12 months). But the same effect size will translate into many more months of learning in eighth grade because of smaller benchmark effect sizes.