1,191
Views
44
CrossRef citations to date
0
Altmetric
Methodological Studies

How Much Do the Effects of Education and Training Programs Vary Across Sites? Evidence From Past Multisite Randomized Trials

, , , , &
Pages 843-876 | Received 13 May 2016, Accepted 24 Mar 2017, Published online: 05 Jun 2017
 

ABSTRACT

Multisite trials, in which individuals are randomly assigned to alternative treatment arms within sites, offer an excellent opportunity to estimate the cross-site average effect of treatment assignment (intent to treat or ITT) and the amount by which this impact varies across sites. Although both of these statistics are substantively and methodologically important, only the first has been well studied. To help fill this information gap, we estimate the cross-site standard deviation of ITT effects for a broad range of education and workforce development interventions using data from 16 large multisite randomized controlled trials. We use these findings to explore hypotheses about factors that predict the magnitude of cross-site impact variation, and we consider the implications of this variation for the statistical precision of multisite trials.

Funding

The research reported here was supported by the Institute of Education Sciences, U.S. Department of Education, through Grant R305D140012 to MDRC, and by the Spencer Foundation through Grant #201500035. The opinions expressed are those of the authors and do not represent the view of the funders.

Notes

1 We focus solely on cross-site variation in ITT effects for two reasons: (a) this is the logical starting point for quantifying impact variation, given that well-implemented multisite RCTs produce unbiased ITT estimates for each site and (b) the analytics of cross-site impact variation for complier average causal effects (CACE), in comparison with those for ITT, are more complex, less well developed, less well understood, and require more assumptions. In the future, we hope to expand the present analysis to cross-site variation in CACE.

2 Consequently, we excluded (a) cluster-randomized trials (CRTs) where, for example, classrooms were randomly assigned within sites (e.g., schools) because of their typically limited site-level precision; (b) CRTs for which whole schools were randomly assigned within sites (e.g., school districts) because of their inability to estimate average effects for individual schools; (c) regression discontinuity designs (RDDs) because of their typically limited site-level precision; and (d) cluster-level RDDs because of their doubly limited site-level precision. Exclusion of these types of studies may limit the substantive scope of our findings.

3 A super population of sites can be either finite or infinite.

4 Researchers do not all agree on who should be included in the inference population for a given study. Some researchers prefer making inferences to a super population, as is the case for this paper. For a super population inference, cross-site impact variation influences statistical precision. Other researchers prefer a “fixed” inference population, where the aim is to draw statistically based inferences only for the sites in a study sample. For a fixed population inference, statistical precision is not influenced by cross-site impact variation. Nonetheless, the magnitude of existing cross-site impact variation should be of broad substantive interest to researchers, practitioners, and policymakers. For detailed discussions of different inference populations, see, for example, Crits-Christoph, Tu, and Gallop (Citation2003); Hedges and Rhoads (Citation2009); Schochet (Citation2015); Senn (Citation2007); Serlin, Wampold, and Levin (Citation2003) and Siemer and Joorman (Citation2003).

5 See Raudenbush and Bloom (Citation2015) and Raudenbush (Citation2015) for discussions of related estimands.

6 For some studies, random assignment blocks were defined as the sites of interest (e.g., schools) and thus . For other studies, random assignment blocks were nested within sites (e.g., when random assignment was conducted separately for multiple grades and/or student cohorts from a given school) and thus .

7 A common alternative is to allow the intercept to vary randomly across sites instead of including a fixed intercept for each random assignment block. However, as Bloom et al. (Citation2017) note, “obtaining good estimates of impact variation using this approach requires strong assumptions about the randomly varying intercept, which are relaxed by…” the fixed intercept, random treatment coefficient model that we use here.

8 The notation “” is used to distinguish the residual variance (after accounting for covariates and random assignment block indicators) referred to here, from the total outcome variance, discussed later. Bloom et al. (Citation2017) provide further information about this model and Raudenbush and Bloom (Citation2015) and Raudenbush (Citation2015) explore its properties. Additional online supplementary materials provide SAS code for implementing this estimation model.

9 Although we defined sites as treatment locations, other possible definitions could reflect various combinations of treatment locations, student cohorts, school grade level, or additional factors.

10 There are two exceptions: For the Welfare-to-Work study, baseline covariates used in the estimation model were imputed by the original study team and nonimputed data were not available. For the national Head Start Impact Study, two baseline covariates (mother's age and an indicator for teen mom) used in the estimation model were imputed by the original study team and nonimputed data were not available.

11 This approach can, however, introduce bias for observational studies (Jones, Citation1996).

12 In theory, differential treatment and control group attrition that varies across sites has the potential to bias our estimator of cross-site impact variation. Although a full analysis of this issue is beyond the scope of this paper, a selective analysis for two studies for which it was possible to readily obtain the requisite data suggests that it was not a problem. Please contact the authors for further details.

13 In both cases, for multigrade studies we calculate effect sizes in reference to within-grade variability only. That is, z-scoring is done separately within grade. In addition, when the study's control group is the reference group, z scores are calculated separately within metric. For example, if different assessments (e.g., different state tests) were administered to different sample members, z scores were calculated separately for sample members taking the different assessments.

14 Ideally, we would always standardize outcomes relative to their distribution for a national population. For state assessments, this could be accomplished using methods developed by Reardon, Kalogrides, and Ho (Citation2016). However, in this paper this was not possible for the studies using state assessments because, to protect the confidentiality of their study participants, most data files did not include state identifiers.

15 For example, if the sample members in a trial are homogenous, with little variation in their outcomes, then variation in treatment effects in control-group-based z-score units will look relatively larger than their counterparts stated in reference-population-based z-score units. This same problem exists—we expect to a lesser extent—with estimates from studies that standardize with respect to reference populations from different states, or comparisons of effects from a state-normed estimate with effects from a nationally normed estimate.

16 Two exceptions are the Head Start Impact Study (HSIS) and the Tennessee Student/Teacher Achievement Ratio (STAR) study. For HSIS, we were not able to obtain the mean and standard deviation from a national norming sample for the PPVT, the Externalizing Behavior scale, or the Self-Regulation scale. These measures are not included in . For Tennessee STAR, we were not able to obtain information for a relevant reference population; thus, we do not present its estimates in . presents results for all outcomes that we analyzed (including those that are and are not reported in ) in control-group-based standardized units.

17 The estimated mean treatment effects from our analyses are generally consistent with the estimated overall average treatment effects reported by the original studies.

18 These site-specific effect-size estimates are constrained to ensure a cross-site variance that equals the estimated cross-site variance of effects sizes from our analysis (). This constraint adjusts for the fact that conventional empirical Bayes estimates tend to understate true impact variability across sites (Bloom et al., Citation2017; Raudenbush & Bryk, Citation2002).

19 To provide findings that were as comparable as possible across studies in our analysis, we reported all estimated impacts on earnings in 2015 dollars.

20 We report our findings for SSCs in terms of the effect of intent to treat in order to be consistent with the findings we report for all other studies. Hence, the magnitudes of present SSC findings are smaller than those reported by the original authors (e.g., Bloom & Unterman, Citation2014), which are in terms of local average treatment effects.

21 Notably, in HSIS, After School Reading and Math, and ERO, four of the studies where the reference-population-based is very similar to the control-group-based , all students took the same study-administered assessment. In contrast, in Charters, TFA-Math, and TFA-pooled, students took tests on different scales if they were in different grades or states. The resulting z scores may contribute to the observed differences in for Charters and TFA-Math for the two reference groups.

22 Given the complex estimation process for and —and especially the complex process required to estimate uncertainty (asymmetric confidence intervals) for —it is not clear how to correct for this bias.

23 We do not focus on the other two Cs proposed by Weiss et al. (Citation2014), client characteristics or context. We do not consider the ability of cross-site variation in client characteristics to predict the magnitude of cross-site impact variation because: (a) there is little existing general theory upon which to base such predictions; (b) prior empirical research on program impact variation across subgroups defined by client characteristics has been especially difficult to replicate, which calls into question the predictive value of those characteristics (e.g., Rothwell, Citation2005; CitationTipton, Yeager, Schneider, & Iachan, forthcoming; Yusuf, Wittes, Probstfield, & Tyroler, Citation1991); and (c) most variation in client characteristics (e.g., prior student achievement) is typically within sites instead of between sites (e.g., Hedges & Hedberg, Citation2014). We do not consider the ability of cross-site variation in program context to predict the magnitude of cross-site impact variation because, here again (a) there is little existing general theory to rely on; and (b) when researchers, policymakers, or practitioners focus on program context, they often have in mind the quality and/or quantity of alternative services to control group members, which is part of a program's service contrast as we define it.

24 Here, we are defining the multidimensional service contrast with respect to the entire education/job training production function, even those parts of the production function that the program is not intended to influence.

25 Let represent the average services/activities experienced under the program condition at site and represent the average services/activities experienced under the counterfactual condition at site . By definition, is the average service contrast at site .Therefore, .

26 It also seems reasonable that, all else being equal, programs that are not high intensity will have smaller average effects.

27 With respect to outcome timing, to the extent that impacts tend to fade over time, impact variation will fade as well. With respect to outcome type, test scores are often normally distributed, earnings are often zero-inflated and heavily skewed, and binary outcomes have an altogether different distribution, which in turn can change the meaning of a standardized metric.

28 Although we recognize that such cluster adjustments may not be optimal for the small number of clusters (studies) in our analysis, we do not believe that the highly exploratory nature of the analysis reported here warrants more complex inference approaches.

29 Several outcomes were excluded because they were the only outcome of a particular type. These include: socioemotional outcomes (for the Head Start Impact Study), the number of months employed (for the Career Academies and Job Corps studies), whether sample members were ever arrested (for the Job Corps Study), and the two outcomes measured for Communities in Schools.

30 The predicted values are for a case with the average value of the covariates for the 40 observations included in the regression model.

31 Only three studies meet our strict definition of high specificity, which limits any conclusions about this hypothesis.

32 See, for example, Bloom and Spybrook (Citation2017/this issue), Dong and Maynard (Citation2013), or Schochet (Citation2008) for a discussion of this expression for individually randomized multisite trials. It is a good approximation for many situations and assumes, for simplicity, an equal number () of sample members at every site, an equal proportion () of sample members at every site randomized to treatment, and an individual-level residual outcome variance () that is the same for all sites and for treatment and control group members. Bloom and Spybrook (Citation2017/this issue) note that for studies with sites that vary in terms of their sample sizes () and proportions of sample members randomized to treatment (), the harmonic mean value of these parameters is usually an appropriate value to use for approximating a minimum detectable effect or effect size.

33 This represents the total control-group variation in the outcome measure within and between random assignment blocks.

34 Pages 158 and 159 of Bloom (Citation2005) explain why the multiplier for a minimum detectable effect () equals , where is the critical t value for a two-tailed hypothesis test and is the t value for power equal to .

35 Note that the expression, , is equivalent to the term .

Log in via your institution

Log in to Taylor & Francis Online

PDF download + Online access

  • 48 hours access to article PDF & online version
  • Article PDF can be downloaded
  • Article PDF can be printed
USD 53.00 Add to cart

Issue Purchase

  • 30 days online access to complete issue
  • Article PDFs can be downloaded
  • Article PDFs can be printed
USD 302.00 Add to cart

* Local tax will be added as applicable

Related Research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.