566
Views
2
CrossRef citations to date
0
Altmetric
Methodological Studies

Using Data from Randomized Trials to Assess the Likely Generalizability of Educational Treatment-Effect Estimates from Regression Discontinuity Designs

, &
Pages 488-517 | Received 10 Oct 2018, Accepted 14 May 2019, Published online: 24 Jan 2020
 

Abstract

This article assesses the likely generalizability of educational treatment-effect estimates from regression discontinuity designs (RDDs) when treatment assignment is based on academic pretest scores. Our assessment uses data on outcome and pretest measures from six educational experiments, ranging from preschool through high school, to estimate RDD generalization bias. We then compare those estimates (reported as standardized effect sizes) with the What Works Clearinghouse (WWC) standard for acceptable bias size (≤ 0.05σ) for two target populations, one spanning a half–standard deviation pretest-score range and another spanning a full–standard deviation pretest-score range. Our results meet this standard for all 18 study/outcome/pretest scenarios examined given the narrower target population, and for 15 scenarios given the broader target population. Fortunately, two of the three exceptions represent pronounced “ceiling effects” that can be identified empirically, making it possible to avoid unwarranted RDD generalizations, and the third exception is very close to the WWC standard.

Acknowledgments

The authors thank Michael Weiss, Pei Zhu, Kristin Porter, Himani Gupta, Daniel Cullinan and Alec Gilfillan for their valuable input.

Notes

1 Some early RDD analyses (see Cook & Campbell, Citation1979 for a discussion) used what are now considered to be strong assumptions about the functional form of the relationship between mean outcomes and ratings to generalize treatment-effect estimates beyond an RDD cut-point.

2 For example, Tipton, Yeager, Iachan, and Schneider (Citation2019) note the frequent inability to replicate experimentally estimated treatment-effect differences for demographic subgroups.

3 By true rating we mean whatever an RDD rating measures systematically, regardless of what it is intended to measure. Hence, we focus on the reliability of an RDD rating, not its validity.

4 This approximation improves as the interval around the RDD cut-point decreases.

5 This result reflects the well-known phenomenon of attenuation bias or errors-in-variables bias in an estimated regression coefficient due to random error in the independent variable for that coefficient (e.g., Angrist & Pischke, Citation2015, pp. 240–241; Wooldridge, Citation2009, p. 320).

6 Chaplin et al. (Citation2018) examine the internal validity of RDD treatment-effect estimates in practice through a meta-analysis of 15 within-study comparisons of RDD findings to their RCT counterparts. In doing so, the authors estimate the mean and variation of RDD bias across studies. They use the mean RDD bias as a summary measure of the internal validity of the method for the range of situations examined. They use variation in assessed RDD bias as a measure of the external validity of the estimated mean bias for assessing the likely internal validity of a specific RDD. Their focus on the external validity of a bias assessment of the RDD method differs from the present focus on the likely generalizability of RDD treatment-effect estimates.

7 The authors also note that a second empirical test of the validity of such nonexperimental estimators can be constructed by comparing a properly weighted nonexperimental estimate of the mean treatment effect at the RDD cut-point with its RDD counterpart. This weighting must match the distribution of covariate values at the RDD cut-point.

8 These covariates were not part of the rating variable used to assign students to the elite exam schools.

9 The authors demonstrate for continuous and continuously differentiable outcome/rating functions immediately below and above an RDD cut-point, with a mean outcome under the treated condition of E(Y1), a mean outcome under the untreated condition of E(Y0), an observed rating of R, and a mean treatment effect of τ (all at the RDD cut-point) that dτdR=dE(Y1)dRdE(Y0)dR.

10 Consider the following local linear regression for a given bandwidth around an RDD cut-point with a uniform kernel. Yi=α+βRi+γTi+δTiRi+ei where: Ri is the rating for subject i, Ti equals one if subject i was assigned to treatment and zero otherwise, and ei is a random error that is distributed independently and identically across subjects with a mean of zero. In this case dτdR=dE(Y1)dRdE(Y0)dR =β+δβ=δ.

11 Online Appendix A presents supplementary tables for the present article, and Online Appendix B describes how we constructed our analysis samples and presents a detailed examination of sample attrition and treatment/control group pretest balance. Those results demonstrate that attrition rates for all studies are within or near the What Works Clearinghouse standard for “low attrition” (U.S. Department of Education, Institute of Education Sciences, Citation2017) and that treatment/control group pretest balance is excellent for all studies.

12 All test-score outcomes except the PPVT test of receptive vocabulary for the Head Start Impact Study could be standardized as a broad-based z score.

13 Randomized treatment assignment was blocked by a combination of site, student grade, and/or student cohort for the studies in our analysis.

14 Online Appendix Tables A.2 and A.3 report standard errors for our estimates of mean treatment effects for each of the three and five bins for each scenario.

15 A ceiling effect can be identified for positive impacts but not for negative effects.

16 In our judgement, meeting the first criterion without meeting the second does not constitute consistent evidence of a non-monotonic relationship between treatment effects and pretest scores.

17 It is theoretically possible that with no overall impact/pretest-score covariation and no variation in mean impacts across pretest-score bins, that impact variation within bins could produce problematic RDD generalization bias. However, for this to occur would require: (1) a large and abrupt impact aberration within a bin, which seems unlikely, plus (2) an RDD cut-point that falls on this impact aberration, which seems unlikely. The joint occurrence of these two unlike conditions is thus very unlikely.

18 These percentiles are for the non-parametric sample. For more information, Online Appendix Tables A.6 and A.7 report median pretest scores for each of three bins and each of five bins.

19 Two important exceptions to this general treatment assignment tendency are the Thistlethwaite and Campbell (Citation1960) study of the effects of National Merit Scholarships (which first introduced the regression discontinuity design) and the Angrist and Rokkanen (Citation2015) study of Boston’s elite exam schools discussed earlier.

20 The three characterizations of our bias predictions discussed here hold for the full range of possible pretest scores when treatment effects are a monotonic function of pretest scores, as in our linear model. These characterizations also hold for potentially large portions of the pretest-score range when treatment effects are a non-monotonic function of pretest scores, as in our quadratic model, the results of which are discussed later.

21 Sensitivity tests that we conducted indicate that our quadratic GENBIAS predictions for the one scenario with evidence of a quadratic relationship between treatment effects and pretest scores vary somewhat for moderate changes in the value of P1.

22 WWC standards for acceptable bias from sample attrition should apply equally well to bias from RDD generalization, because both biases represent the distance between the expected value of an estimator and the actual value of its estimand, where an estimand represents a specific parameter for a specific population.

23 It is interesting to note that two of the three previous studies of RDD generalizability that we discussed earlier (Angrist & Rokkanen, Citation2015; Wing & Cook, Citation2013) found empirical support for RDD generalizability, whereas the third study (Dong & Lewbel, Citation2015) found evidence of RDD generalization bias.

24 Because the reliability estimates we found were based on different reliability measures for different pretests (e.g., internal consistency, split-half or test-retest reliability), these findings are not fully comparable.

25 We thank Mike Weiss for this suggestion.

Additional information

Funding

This article was supported by grant RD305D140012 from the Institute of Education Sciences and grant 201500035 from the Spencer Foundation. However, all views and information presented are the sole responsibility of the authors.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.