Abstract
There is growing pressure to make efficacy experiments more useful. This requires attending to the twin goals of generalizing experimental results to those schools that will use the results and testing the intervention's theory of action. We show how electronic records, created naturally during the daily operation of technology-based interventions, contain the information needed to attend to these twin goals. These records allow researchers to define the population of schools considering adoption of an intervention and to plan an experiment to generalize to these schools. They also allow researchers to identify schools likely to fully implement the intervention, such that the theory of action can be properly tested. Designing experiments to address these goals involves many tradeoffs and prioritizing the different purposes of the planned experiment. We discuss these challenges, linking experimental purposes with design decisions.
Notes
1 Our target population is the unknown potential users, not the current users as Tipton et al. (Citation2014), a subtle difference which motivates some analytic choices as we discuss. When the target population is current users, we argue for quasi-experimental approaches due to (a) their greater external validity, (b) their cheaper cost, and (c) changes to program implementation that may occur over time.
2 The very results of the experiment may thus disrupt the diffusion of a program, but given the weak role that experimental evidence plays in adoption decisions (Nelson et al., Citation2009), this seems unlikely in most cases.
3 SchoolDigger scrapes publicly available state data. We spot-checked test scores and found that CCD variables correlated with the CCD files above 0.995.
4 As nonparticipation is assumed random, the simulation finds that the Tipton (Citation2013) approach and the random sampling approach are effectively equal so we show only random sampling.
5 We could use a risk set matching approach to match each BURST school with schools in the population frame in their risk set within a specified caliper (see Rosenbaum, Citation2009). This would arguably create a more accurate propensity score for the PUB by incorporating the time-varying nature of BURST adoption into the analysis. Sampling could then follow the recommendations of Tipton (Citation2013). This approach significantly complicates an approach already more complex than current practice without a clear payoff.
6 This effectively sets the prior so that trees with depth 0 occur 80% of the time and trees with depth 1 about 20% of the time, limiting the growth of any tree. This restricts the complexity of prediction by limiting the extent to which interactions and nonlinear effects are modeled, unless the data strongly suggest these should occur (Chipman et al., Citation2010).
7 It is common when oversampling from a stratum to down-weight estimates to ensure that a stratum is not overrepresented to the sample average. We do not recommend that here as the strata are intended to ensure that the PUB is broadly sampled rather than serving to provide a precise estimate of the PUB. Rather, we recommend using post hoc adjustments (Stuart et al., Citation2015).