1,844
Views
9
CrossRef citations to date
0
Altmetric
Original Articles

Replicating Experimental Impact Estimates with Nonexperimental Methods in the Context of Control-Group Noncompliance

Pages 1-11 | Received 01 Jan 2014, Accepted 01 Aug 2015, Published online: 09 Feb 2016

ABSTRACT

A growing literature on within-study comparisons (WSC) examines whether and in what context nonexperimental methods can successfully replicate the results of randomized experiments. WSCs require that the experimental and nonexperimental methods assess the same causal estimand. But experiments that include noncompliance in treatment assignment produce a divergence in the causal estimands measured by standard approaches: the experiment-based estimate of the impact of treatment (the complier average causal effect, CACE) applies only to compliers, while the nonexperimental estimate applies to all subjects receiving treatment, including always-takers. We develop a new replication approach that solves this problem by using nonexperimental methods to produce an estimate that can be compared to the experimental intent-to-treat (ITT) impact estimate rather than the CACE. We demonstrate the applicability of the method in a WSC of the effects of charter schools on student achievement. In our example, some members of the randomized control group crossed over to treatment by enrolling in the charter schools. We show that several nonexperimental methods that incorporate pretreatment measures of the outcome of interest can successfully replicate experimental ITT impact estimates when control-group noncompliance (crossover) occurs—even when treatment effects differ for compliers and always takers.

Introduction

Nonexperimental methods are essential tools in policy research and evaluation since it is not always feasible to conduct rigorous randomized experimental studies—the “gold-standard” of causal inference—to estimate the impact of various interventions, practices, or policies of interest. However, they need to be tested against the most rigorous randomized experimental methods to provide confidence in their validity and rigor. A growing literature on within-study comparisons (WSCs) examines whether and in what context nonexperimental methods can successfully replicate the results of randomized experiments (Glazerman, Levy, and Myers Citation2003; Cook, Shadish, and Wong Citation2008; Fortson et al. Citation2012).

A valid replication exercise requires that the experimental method and the nonexperimental method are measuring the same causal estimand, that is, that they do not differ in terms of the causal mechanism or the subject population (Cook, Shadish, and Wong Citation2008). Ideally, WSCs should focus on the intent-to-treat (ITT) experimental impact estimate, because it is the most causally rigorous measure. In some cases, replicating an ITT experimental impact estimate can be done by identifying a nonexperimental comparison group that closely resembles the experimental treatment group on observable characteristics but does not receive the treatment.

In field experiments, however, study subjects who are randomly assigned to the control group do not always comply with their assignments—some subjects assigned to the control condition receive treatment. Noncompliance occurs in randomized studies of job training programs (Schochet, Burghardt, and Glazerman Citation1999), military service (Angrist, Imbens, and Rubin Citation1996), maternal smoking (Permutt and Hebel Citation1989), school choice (Abdulkadiroglu et al. Citation2011; Angrist, Pathak, and Walters Citation2011; Dobbie and Fryer Citation2011a; Tuttle et al. Citation2012), and class size (Krueger Citation1999) as well as many medical interventions (see Hollis and Campbell Citation1999). In short, crossover by control-group members (noncompliance with random assignment) is endemic to field experiments in many different policy areas.

Control-group noncompliance undermines the conventional approach to replication because a nonexperimental method necessarily excludes from its comparison group subjects who are receiving treatment. In other words, there is no way for the nonexperimental approach to identify a comparison group that fully resembles the experimental control group, because there is no way for the nonexperimental comparison group to include subjects receiving treatment. A conventional nonexperimental approach cannot be appropriately used in a WSC in the presence of control-group noncompliance when the causal construct of interest is an ITT impact estimate. Under those circumstances the experimental and nonexperimental estimands would not be measuring the same thing (see Cook, Shadish, and Wong Citation2008).

To the extent that WSCs have been conducted in the context of control-group noncompliance, researchers have bypassed this problem by seeking to replicate the complier average causal effect (CACE) that is normally identified in a two-stage least-square (2SLS) approach that uses the lottery as an instrument for treatment (Angrist, Imbens, and Rubin Citation1996). But this is a less-than-ideal solution. As we discuss in the subsequent section, the CACE estimand represents the impact of treatment on a narrower population than the population on which standard nonexperimental methods identify treatment effects. Because they focus on a different causal estimand, CACE estimates cannot provide fair experimental benchmarks with which to compare nonexperimental estimates in a WSC. There is a clear need for a WSC approach that allows nonexperimental estimates to be compared against experimental estimates capturing the same causal estimand when control-group noncompliance occurs.

This article develops a new WSC approach that allows nonexperimental methods to be tested against rigorous experimental ITT impact estimates in the presence of substantial control-group noncompliance (crossover into treatment). The key innovation of this approach is to extract nonexperimental ITT estimators from standard nonexperimental estimators so that they measure the same causal estimand as the experimental ITT estimators. We then apply this new method to examine whether three nonexperimental approaches—(i) ordinary least-square regression, (ii) propensity score matching, and (iii) exact matching—can replicate randomized experimental ITT impact estimates of charter schools on student achievement when there are high rates of control-group noncompliance. Using the new replication approach, we find that all three methods produce results that are nearly identical to ITT experimental impact estimates. Although the data for the specific exercise come from charter-school lotteries, the approach is relevant to any WSC that involves control-group noncompliance. The approach we develop here will allow researchers to produce nonexperimental impact estimates that can be usefully compared to experimental ITT estimates in WSCs.

The next section explains why the ITT (or “reduced form”) impact estimate from a randomized experiment (rather than a CACE based on a 2SLS impact estimate) is the appropriate standard to be used in a replication exercise, particularly when control-group noncompliance exists. In the subsequent section, we derive the nonexperimental equivalent of an experimental ITT estimate when some control group members receive treatment. The following sections describe the data and the estimation methods and present results. Finally, we discuss the implications of the findings, both for future WSCs and for the use of nonexperimental methods when experimental data are unavailable.

Why Replicate ITT?

For purposes of causal inference, randomized experimental designs are the “gold standard.” Properly designed and implemented experiments create treatment and control groups that are equivalent in expectation on observed and unobserved characteristics prior to receiving an intervention (Shadish, Cook, and Campbell Citation2002; Murnane and Willet Citation2011). Thus, any statistically significant differences between the groups' outcomes can be attributed to the impact of the intervention. However, they are often difficult and expensive to conduct in the field and may be impractical or infeasible in some settings, and thus are infrequent.

In field experiments, noncompliance with random assignment (i.e., crossover) is common. Some subjects randomly placed in the treatment group may decline treatment, while other subjects randomly placed in the control group may find alternate ways to receive treatment. Noncompliance routinely occurs, for example, in studies that rely on randomized admission lotteries of oversubscribed schools to implement an experimental research design in measuring the impacts of the schools (e.g., Abdulkadiroglu et al. Citation2011; Angrist, Pathak, and Walters Citation2011; Dobbie and Fryer Citation2011a; Tuttle et al. Citation2012). Often, some lottery losers find another way to be admitted to the school (and many lottery winners decline admission). In a very different context, noncompliance was also recorded in using the draft lottery as an instrument to measure the effect of military service during the Vietnam War, when many men chose to serve even when their randomized lottery numbers would have allowed them not to serve (Angrist, Imbens, and Rubin Citation1996). Experimental studies of behavioral health interventions recognize that some control-group participants voluntarily engage in the desired treatment.

When noncompliance occurs, rigorous ITT impact estimates will understate the impact of receiving treatment, which is often of greater interest to policymakers and other stakeholders. To estimate the impact of receiving treatment, researchers typically use the random assignment as an instrument in a 2SLS analysis (Angrist, Imbens, and Rubin Citation1996). This provides an estimate of the CACE, the effect of program participation on subjects who comply with their random assignment. (In charter-school studies, researchers sometimes can minimize control-group noncompliance by monitoring admissions offers from the waitlist, so that any student offered admission prior to the start of school—including students who were not offered admission at the time the lottery was conducted—is defined as a treatment student (Gleason et al. Citation2010; Abdulkadiroglu et al. Citation2011). This definitional minimization of control-group noncompliance often cannot be used for two reasons. First, schools with oversubscribed lotteries often ultimately offer admission to all students not admitted at the time of the lottery, meaning there are no randomized students who never received an offer (i.e., no control students). Second, some schools do not have good records on whether the randomization order was followed when admitting students from the waitlist, after the lottery. Consequently, some studies (Abdulkadiroglu et al. Citation2011; Furgeson et al. Citation2012) define treatment as receiving an offer at the time of the lottery; students admitted off the waitlist to replace those admitted students who do not enroll become control-group noncompliers. These studies have rates of control-group noncompliance approaching 50% (Furgeson et al. Citation2012).)

In the presence of noncompliance, researchers seeking to validate nonexperimental methods sometimes attempt to replicate the CACE impact estimate (e.g., McKenzie, Gibson, and Stillman Citation2007; Abdulkadiroglu et al. Citation2011, as cited in Cook, Shadish, and Wong Citation2008). Replicating 2SLS CACE impact estimates is an imperfect solution to the crossover replication challenge because the impact estimates produced by 2SLS CACE and nonexperimental approaches pertain to different populations. The 2SLS approach estimates the impact of treatment receipt only for a specific subset of subjects, known as compliers: those who receive treatment if and only if they are randomly assigned to treatment. However, standard nonexperimental estimators for the impact of treatment receipt attempt to infer how an intervention affected all treated individuals—the average treatment effect on the treated (ATT). In particular, in a traditional WSC, the nonexperimental estimator takes all individuals who received treatment within the randomized treatment group as the treated sample, inferring their counterfactual outcomes from a nonexperimental comparison sample. This is problematic because, among subjects assigned to treatment, not all those who received treatment are compliers; some are always-takers, as they would have found a way to receive treatment had they been assigned to the control group. Although they cannot be distinguished from compliers, always-takers must be present in the randomized treatment group if the randomized control group contains always-takers—that is, if control-group crossover exists. (Defiers are another group: those who would never enroll, regardless of random assignment. Defiers are not relevant in this context.) Therefore, when control-group noncompliance (crossover) occurs, the nonexperimental ATT estimand—the causal effect on both treated compliers and always-takers—differs from the CACE estimand, violating one of the basic principles of replication: that differences in the study design (experimental vs. quasi-experimental) should not be confounded with differences in the populations to which the studies pertain. When this principle is violated, it is unclear whether a discrepancy between the nonexperimental and experimental estimate implies that the nonexperimental estimate is biased or that the two populations have different true treatment effects.

Using the experimental ITT estimate as the benchmark for replication offers a path to eliminating this confound. However, the nonexperimental ATT estimator must be transformed so that it estimates the same ITT estimand. In the following section, we describe how to extract a nonexperimental ITT estimator from a standard nonexperimental ATT estimator.

Replicating Experimental ITT Using Nonexperimental Methods

We propose to use nonexperimental panel (NXP) methods to estimate an impact that is comparable to the gold-standard experimental ITT impact for the same subjects. The intuition behind our approach is to resolve a key conceptual discrepancy in which the experimental control group has noncomplying crossovers who receive treatment whereas the nonexperimental comparison group does not. Although we cannot identify the specific individuals in the nonexperimental comparison group who are poor substitutes for the experimental crossovers, we can nevertheless estimate their mean outcome. We then essentially replace this mean with the actual mean outcome of noncomplying control-group crossovers. Therefore, any remaining difference in outcomes between the experimental control group and nonexperimental comparison group should be due only to nonexperimental bias in estimating ITT impacts on compliers—not due to differences in the populations represented by the control and comparison groups.

The Experimental ITT Estimand

Using potential outcomes notation, let Di(Zi) ∈ {0, 1} be a binary indicator for whether individual i would take up a specified treatment if assigned to treatment status Zi ∈ {0, 1}. Moreover, let Yi(Zi, Di(Zi)) be the outcome that this individual would exhibit if he or she were assigned to treatment status Zi and had take-up status Di. Suppressing subscripts for ease of notation, the experimental ITT estimate is an unbiased estimate of the true estimand, ITTR, which can be expressed as (1) ITT R=E[Y1,D1-Y0,D0|R=1](1) where R is an indicator for being in the experiment. Equation (Equation1) can be expanded into a weighted average of the impacts of treatment assignment on always-takers (denoted by A), compliers (denoted by C), and never-takers (denoted by N), who constitute proportions pA, pC, and pN of the experimental sample: (2) ITT R=pAE[Y1,1-Y0,1|R=1,A=1]+pCE[Y1,1-Y0,0|R=1,C=1]+pNE[Y1,0-Y0,0|R=1,N=1](2)

We can always estimate the proportions of always-takers, compliers, and never-takers in the population because (1) they do not differ by assigned treatment status (due to randomization), and (2) pA and pN are observed (pA is the proportion of the study participants assigned to the control group that takes up treatment, while pN is the proportion of the study participants assigned to the treatment group that does not take up treatment), enabling the calculation of pC = 1 − pApN.

The ITT estimand shown in Equation (Equation2) can be simplified by assuming that the exclusion restriction—the assumption that assignment affects outcomes only through take-up—holds for the subset of the analysis group who are never-takers. Even if the exclusion restriction fails with respect to always-takers (because among the always-takers, lottery winners may experience different treatment than lottery losers), it is reasonable to assume that the restriction holds with respect to never-takers: never-takers do not experience treatment of any kind, so the assumption that they are unaffected regardless of treatment assignment is plausible.

Assuming never-takers are unaffected by treatment assignment, Y(1, 0) − Y(0, 0) = 0 for N = 1 and Equation (Equation2) can be simplified to (2′) ITT R=pAE[Y1,1-Y0,1|R=1,A=1]+pCE[Y1,1-Y0,0|R=1,C=1].(2′)

The Nonexperimental Matching Estimator

Our central aim is to determine how a nonexperimental matching estimator can be transformed to estimate the same causal quantity as (2′). The matching estimator, M^ treated ) and the mean outcome of comparison sample that, by definition, does not receive treatment (M^ comp ): (3) AT^T nxp =M^ treated -M^ comp .(3)

In the context of a WSC, the treated sample consists of the members of the experimental treatment group who actually receive treatment. Every treated member from the experimental treatment group is matched with a member of a nonexperimental comparison group who, by definition, is not receiving treatment. Therefore, the nonexperimental impact estimator, without further adjustment, estimates an average treatment effect on the treated (ATT), and we consider next the steps needed to adjust the nonexperimental ATT estimator into a nonexperimental ITT estimator.

To illuminate the steps needed for the transformation, it is instructive to characterize M^ treated and M^ comp in potential outcomes notation. Due to the structure of the WSC in which the nonexperimental ATT estimator shares the same treated sample as the experimental treatment group, M^ treated is simply a weighted average of the observed outcomes of always-takers and compliers in the experimental treatment group, with weights reflecting the relative size of the two subgroups. Those observed outcomes, in turn, are the same as the individuals' potential outcomes when assigned to treatment. Therefore, we can express M^ treated as (4) M^ treated =pApA+pCE[Y1,1|R=1,Z=1,A=1]+pCpA+pCE[Y1,1|R=1,Z=1,C=1].(4)

In expectation, both always-takers and compliers have the same distribution of outcomes in the treatment group as they do in the full experimental sample due to random assignment, so Equation (Equation4) implies that (5) E(M^ treated )=pApA+pCE[Y1,1|R=1,A=1]+pCpA+pCE[Y1,1|R=1,C=1].(5)

The matched comparison sample consists of: (1) untreated subjects, denoted by the indicator MA, who are matched to treatment group always-takers and (2) untreated subjects, denoted by the indicator MC, who are matched to treatment group compliers. Because we cannot distinguish always-takers and compliers in the treatment group, we also cannot distinguish the two subgroups of the matched comparison sample. Nevertheless, by the matching design, we know that the shares of A and C individuals in the treated sample are identical to shares of MA and MC individuals in the matched comparison sample. Moreover, we know that the observed outcomes of the comparison individuals reveal their potential outcomes in the absence of being assigned to treatment and in the absence of taking up treatment. Combining this information, we can express M^ comp as (6) M^ comp =pApA+pCE[Y0,0|MA=1]+pCpA+pCE[Y0,0|MC=1].(6)

The validity of any nonexperimental estimator depends on the assumption known as strong ignorability (Rosenbaumm and Rubin Citation1983)—which, in this case, is the assumption that the distribution of potential outcomes in the matched comparison sample is the same as its distribution in the treated sample. The basic aim of the WSC is to test whether strong ignorability is satisfied. However, for the purpose of this derivation, we are interested in how the ATT estimand—the causal quantity that is captured by the nonexperimental ATT estimator when strong ignorability is satisfied—differs from the experimental ITT estimand, so that we can determine how the ATT estimator can be transformed into an ITT estimator. Therefore, we assume that strong ignorability is satisfied whenever possible.

In particular, we consider whether strong ignorability can be satisfied for each of the two subgroups in the nonexperimental analysis: (1) treated compliers and their matches; and (2) treated always-takers and their matches. For the first subgroup, there are no theoretical obstacles to satisfying strong ignorability, so we assume that the potential outcomes of MC individuals have the same distribution as those of C individuals in the experiment: (7) E[Y0,0|MC=1]=E[Y0,0|R=1,C=1].(7)

However, strong ignorability is unlikely to be satisfied for the second subgroup, because always-takers from the experiment will always take up the treatment (even if they are assigned not to receive the treatment), whereas the matched comparison individuals do not take up treatment. Therefore, we cannot assume that A and MA individuals have the same potential outcomes when assigned not to get treatment.

Substituting (5), (6), and (7) into (3) and rearranging terms, we obtain (8) EAT^Tnxp=pApA+pCE[Y1,1|R=1,A=1]-E[Y0,0|MA=1]+pCpA+pCE[Y1,1-Y0,0|R=1,C=1].(8)

Because our objective is to transform the nonexperimental estimand to be as analogous as possible to the experimental ITT estimand in Equation (Equation2′), we multiply (8) by (pA + pC) to eliminate the denominators on the right-hand side of Equation (Equation8): (9) (pA+pC)EAT^Tnxp=pA{E[Y1,1|R=1,A=1]-E[Y0,0|MA=1]}+pCE[Y1,1-Y0,0|R=1,C=1].(9)

The causal estimand in Equation (Equation9) is close, but not identical, to the experimental ITT estimand in Equation (Equation2′). The components of the two estimands that pertain to compliers are identical. However, as noted earlier, the counterfactual outcomes of always-takers in (2′) are not the same as the untreated outcomes of individuals who are matched to always-takers in (9). Therefore, the two estimands still differ.

Transforming the Nonexperimental Matching Estimator into an ITT Estimator

To resolve this discrepancy, we conduct another matching exercise in which treated subjects in the experimental control group (i.e., the control group always-takers) are matched with untreated subjects from the nonexperimental comparison sample who are similar on observed characteristics. Let AT^T ec denote the difference in outcomes between treated always-takers in the experimental control group and their matched counterparts who are not in treatment. In potential outcomes notation, (10) EAT^T ec =E[Y0,1|R=1,A=1]-E[Y0,0|MAC=1](10) where MAC is an indicator for individuals who are matched to always-takers in the experimental control group.

We make a key assumption about MAC individuals: their potential outcomes are identical to the potential outcomes of comparison subjects who are matched to always-takers in the experimental treatment group (MA individuals). This assumption is reasonable because both groups of untreated comparison subjects (MA and MAC) are being matched to always-takers from the experiment, and randomization means the always-takers in the experimental treatment and control groups should be equivalent at baseline. Under this assumption, we can rewrite (10) as (11) EAT^T ec =E[Y0,1|R=1,A=1]-E[Y0,0|MA=1],(11) or, equivalently, (12) pAEAT^T ec =pAE[Y0,1|R=1,A=1]-E[Y0,0|MA=1].(12)

Finally, subtracting Equation (Equation12) from Equation (Equation9) gives (13) pA+pCEAT^T nxp -pAEAT^T ec =pAE[Y1,1-Y0,1|R=1,A=1]+pCE[Y1,1-Y0,0|R=1,C=1]= ITT R.(13)

In summary, to convert the nonexperimental ATT estimator into an ITT estimator, we subtract the estimated nonexperimental impacts on the noncomplying control-group crossover subjects (scaled by the proportion of all randomized control subjects who are crossovers) from the estimated impacts on the treatment subjects who receive treatment (scaled by the proportion of all of the randomized treatment subjects who receive treatment).

Data

Description

We use student-level administrative data provided by state departments of education, school districts, and charter-school management organizations (CMOs). The data were collected for a study of charter schools operated by charter management organizations (CMOs) (authors). CMOs aim to improve charter performance by leveraging well-regarded charter school models, and create and operate multiple charter schools under a common structure and philosophy. The sample includes data from four jurisdictions. The treatment schools were 12 oversubscribed middle and high schools across seven CMOs in the 2006–2007, 2007–2008, or 2009–2010 academic years. Three of the schools had oversubscribed lotteries in multiple years.

We focus on two outcomes: reading (labeled English language arts in some states) and math scores on state achievement tests 1 year after students enrolled in school following participation in a lottery (year 1). Where statewide statistics are available, we standardize test scores using state-level means and standard deviations for each grade and cohort. Otherwise, we use district-level means and standard deviations for test score standardization.

Student characteristics available for the experimental and nonexperimental analyses are: baseline reading and math test scores (including missing test score indicators), sex, race/ethnicity (African American, Hispanic, white/other), baseline free- or reduced-price lunch (FRPL) eligibility status, English language learner (ELL) status, special education status (IEP), and an indicator of whether a student attended a charter school in the baseline year. (One district did not have reliable information on students' free- or reduced-price lunch status. One district did not include information on students' ELL status.)

Diversity of Sample

The six CMOs included in the sample are quite diverse. The sample CMOs have schools in three of the four U.S. Census regions: Northeast, South, and West. provides baseline (preenrollment) student characteristics for the CMOs. Prior to enrolling in the sample CMO middle schools, standardized student test scores range from −0.08 to 0.63 in reading and from −0.11 to 0.53 in math (where zero is the average test score in the locality). Special education rates for the CMOs (measured in the year prior to enrollment) range from 5% to 14% of their enrolled students. The percentage of students who were English language learners varies from 4% to 33%. Finally, the CMOs are diverse in terms of race/ethnicity: the percentage of African American students ranges from 11% to 81%, and the percentage of Hispanic students ranges from 17% to 75%.

Table 1. Baseline statistics for treatment charter schools.

Experimental Method

Sample and Baseline Equivalence

The experimental sample frame consists of students who applied to an oversubscribed charter (CMO) school that used a random lottery to admit students. The treatment group is composed of applicants offered admission to a participating CMO school at the time of the lottery. (The enrollment rates of students admitted at the time of the lottery are substantially higher than those of students rejected at the time of the lottery, meaning that this measure provides enough random assignment so that we might plausibly expect to observe an impact. Angrist et al. Citation(2013) used a similar approach. An approach in which assignment is based on whether students were ever admitted to a CMO school was not possible for two reasons. First, many schools with oversubscribed lotteries ultimately admitted all students who were not admitted at the time of the lottery, meaning there were no randomized students who never received an offer (i.e., no control students). Second, many schools did not follow the randomization order when admitting students after the lottery.) Applicants not offered admission at the time of the lottery form the control group. All students who provided consent (obtained prior to the lottery) were in the correct application grade at the time of the lottery, were randomized in the lottery, and had baseline test scores are included in the analysis. A student who applied to more than one of the sample schools could receive an offer to one of the schools even if the student was among the lottery losers at the other school(s). In these cases, we treat all schools sharing applicants as a single site.

To ensure the validity and power of the experimental impact estimates, included sites had to meet each of the following criteria:

1.

The overall and differential attrition rates are lower than the maximum thresholds defined by the U.S. Department of Education's What Works Clearinghouse (liberal attrition standard, Handbook version 2.1, Tuttle et al. 2011);

2.

Either we observed the lottery and could verify its validity, or, if we did not observe the lottery and consequently were unsure of the randomization validity, any difference between treatment and control average baseline test scores is less than 0.25 effect size, and demographic differences are less than 25 percentage points (the effect size measure was Hedge's g. Relatively large baseline differences were allowed because some of the sites were small and could have moderate baseline differences even if the randomization was valid);

3.

The difference between treatment and control groups in enrollment rates in the treatment schools is at least 20 percentage points (ensuring that the lottery assignment predicts treatment well enough that it would be plausible to observe an impact).

Application of these criteria left 579 treatment and 809 control students with baseline data who were eligible for the reading analysis, and 331 treatment and 574 control students with baseline data who were eligible for the math analysis. In the reading impact analysis, we excluded 52 treatment and 74 control students because we were unable to obtain outcome data or they were in the wrong grade in the outcome year (these students without outcome test scores most likely attended a private school or an independent charter school that did not provide data to their district. The students in the wrong grade in the outcome year either repeated or skipped a grade in the outcome year), leaving a final analysis sample size of 527 treatment and 735 control students. In the math analysis, we excluded 13 treatment and 26 control students for the same reasons, leaving a final analysis sample size of 318 treatment and 548 control students. Overall attrition in the reading sample was 9%, with no differential attrition between the treatment and control conditions. Overall attrition in the math sample was 4%, with a 1 percentage point difference between attrition in the treatment and control conditions. The low attrition levels in this study are unlikely to significantly bias impact estimates, according to the What Works Clearinghouse attrition standards (Tuttle et al. Citation2011).

Consistent with minimal bias, baseline statistics of observable characteristics indicate that the final treatment and control groups are very similar for both reading and math impact analyses, with no statistically significant differences (all p-values > 0.10). presents baseline statistics for students included in the math and reading analysis samples.

Table 2. Baseline statistics for experimental treatment and control groups

Experimental ITT Estimation and Weights

To estimate an experimental ITT impact, we compare outcomes of applicants offered admission at the time of the lottery to those of applicants rejected at the time of the lottery, controlling for students' previous test scores and demographic characteristics. (As student admission to CMO schools was randomly determined, we could simply compare the mean outcomes of the treatment and control groups. However, to obtain more precise impact estimates, we adjust for baseline student characteristics in a regression model.) The impact estimation model is (14) yij=α+Xiβ+δTi+Sjθ+Ti×Sjφ+ϵij,(14) where yij is the reading or math test score outcome for student i in site j; α is the intercept; Xi is a vector of student achievement and demographic characteristics (see ); Ti is a binary variable for treatment status, indicating whether student i was admitted at the admission lottery; Sj is a vector of indicators identifying the site j that the student applied to (i.e., site—all schools sharing applicants—fixed effects); ε is a random error term that reflects the influence of unobserved factors on the outcome; and β, δ, θ, and ϕ are vectors of parameters or parameters to be estimated. Each site is defined by a common city, lottery year, and grade level, and thus the fixed effects control for these factors, and also allow student achievement scores within site to be correlated. The estimated coefficient on treatment status, δ, and the interactions between site and treatment status, ϕ, represent the impact of admission to a CMO school at the time of the lottery. (The overall impact is estimated by 1j weight j×δ+ϕSj, where weight indicates the weight for site j based on the number of students in the site and their admission probabilities (see blinded for details). It is possible to estimate an overall impact without the interaction terms (i.e., yij = ∝ + Xiβ + δTi + Sjθ + εij). We prefer to use the model with the interaction terms to estimate overall impacts to be consistent with the NXP approach that must use site-specific impact estimates. (Because the NXP estimates subtract the control enrollee impacts from the treatment enrollee impacts, impacts must be estimated individually for each site and then aggregated.) Although the no-interactions model has identical point estimates to the ones we estimate using Equation (Equation10), it has more precision. As a sensitivity check, we estimated overall impacts using both models; the conclusions from hypothesis tests were unchanged.) Students are weighted to account for admission probabilities (see [blinded] for details).

When students are missing baseline or prebaseline test scores, we include a missing data indicator in the model and set each missing test score to the state or district-level mean, which is zero by design. For students missing demographic variables (race/ethnicity, gender, FRPL, LEP, IEP, baseline charter status), we recode the missing values for these covariates to the mode across all students in the sample (not an English language learner, no IEP, not attending a charter school at baseline, receiving free/reduced-price lunch, female, and Hispanic). In some cases, missing data indicators could not be included because they were perfectly collinear. We do not impute outcome test scores, and students who are missing either a math or reading test score in the follow-up year are excluded from the analysis when that test score is the outcome variable.

Nonexperimental Methods

Following the derivation discussed earlier, to replicate the experimental ITT impact estimate, we require two sets of nonexperimental impact estimates: the impact of charter school enrollment on treatment group enrollees and the impact of charter school enrollment on control group enrollees (noncomplying control-group crossovers). The treatment group enrollees are lottery winners who attended an experimental charter school. The control-group enrollees are lottery losers who attended an experimental charter school. For both groups of enrollees, comparison groups are identified among students who did not attend an experimental charter school.

There is reason for optimism about the validity of nonexperimental methods in studies of educational interventions for which test scores are the outcomes of interest. Such studies can often use preintervention test measures and include comparison students from the same community (Cook, Shadish, and Wong Citation2008). Because pretreatment test scores are highly correlated with post-treatment test scores, including these measures as covariates and/or matching variables might enable nonexperimental approaches to sufficiently account for selection into charter schools. Moreover, selecting comparison students from the same community makes the groups more similar on unobserved characteristics associated with geography. Indeed, three recent studies (Bifulco Citation2012; Fortson et al. Citation2012; Tuttle et al. Citation2012) that compared experimental ITT estimates to nonexperimental estimates using preintervention measures found results that suggest cautious optimism about the performance of nonexperimental approaches—but none of those studies included substantial control-group noncompliance.

Here, we examine three nonexperimental approaches: ordinary least squares (OLS), propensity score matching (PSM), and exact matching (EM). Of these approaches, OLS is the simplest and most commonly used nonexperimental impact estimation approach. However, using an OLS approach may be problematic when there is limited common support. Both PSM and EM address the common support problem by limiting the students in the comparison group to those who are most similar to the students in the treatment group at baseline. EM is advantageous in that only students with the same characteristics (along specified dimensions) as treatment students are included in the comparison group, though as matching dimensions increase, it becomes increasingly difficult to match treatment students. PSM solves this dimensionality problem by matching along only one dimension: the propensity score, or the probability that a given student is in the treatment group. Unlike the experimental approach, all three nonexperimental approaches are unable to account for differences in unobserved baseline characteristics of treatment and comparison students.

Ordinary Least Squares

We estimate impacts using an OLS regression model; covariates are included to improve statistical precision and to control for any remaining differences in baseline characteristics. (For the two matching approaches described below, this follows the creation of matched samples.) The regression model is identical to the model used in ITT experimental analysis (Equation (Equation11)). For the nonexperimental approaches, however, the treatment indicator, T, corresponds to each of the two enrollee groups. In estimating impacts, enrolled students are weighted to account for the probability of winning a lottery admission offer (replicating the experimental impact estimation). The matched comparison students are assigned the analysis weight for the enrolled students to whom they are matched. The experimental admission probability weights are rescaled so that a given site has the same weight in both the experimental and the nonexperimental approaches. This weighting ensures that any potential differences between experimental and nonexperimental estimated impacts can be attributed to the approaches themselves rather than differences in weights. To calculate an overall impact estimate, the site-specific estimates were aggregated using Equation (Equation12). (15) 1j weigh tjpjT× ATT jT-pjC× ATT jC,(15) where weight indicates the weight for site j based on the number of students in the site and their admission probabilities (same for both experimental and nonexperimental), p indicates the percentage of treatment or control enrollment at site j, and ATT indicates the estimated impact for treatment or control enrollees at site j.

Propensity Score Matching

The first step in the propensity score matching (PSM) approach is to estimate a propensity score for each student in the sample. To determine the appropriate propensity score model for each of the two enrollee groups, we use a forward model selection procedure for the logistic regression. Because baseline math and reading test scores are some of the strongest predictors of later outcomes, we specify that the model-building procedure begins with the model containing the two baseline test scores and corresponding missing test score indicators. At each subsequent step, the forward procedure adds a term from a specified set of potential covariates to optimize model fit to the data. The procedure can select from a list of 52 potential covariates: the 11 observed baseline covariates, 39 two-way interactions of these covariates, and 2 interactions of test scores with themselves (i.e., quadratic terms). These models fit the data well, as indicated by the Hosmer and Lemeshow goodness-of-fit test p-values (0.45 for treatment enrollees and 0.78 for control enrollees).

After estimating the propensity scores, we identify comparison students whose estimated propensity scores are similar to those of each treatment student (i.e., comparison students who had similar probabilities of enrolling in CMO schools). The selection uses caliper matching, whereby a given treatment student is matched to all comparison students with estimated propensity scores within a specified range (or caliper), rather than merely selecting a specified number of nearest neighbors. The sampling occurs with replacement. The matching procedure is implemented separately for each jurisdiction. To improve statistical precision, we select multiple comparison students for each treatment student.

For math outcome samples, the matched comparison students on average have similar baseline (pretreatment) math and reading test scores as the treatment students (). They also have similar distributions on all demographic covariates, with the exceptions of race/ethnicity and baseline charter school attendance. (There are only a few students who were not African American or Hispanic in the treatment group. While the race/ethnicity variable was selected by the model selection procedure, the associated coefficients had large standard errors. As a result, race/ethnicity were excluded from the propensity-score matching model, resulting in an imbalance between treatment and matched comparison students. However, race/ethnicity is a covariate in the impact estimation model. In addition, the PSM estimates are almost identical to those from the exact matching approach, which includes race/ethnicity as a matching characteristic. This suggests that our approach is robust to the exclusion of this variable from the propensity model. Similar problems occurred with baseline charter school attendance.) The results were similar for reading outcome samples (not shown). For both samples, we were able to find at least one match for all treatment students. shows the characteristics of students matched to the experimental control group. Again the baseline scores are similar. The only characteristics that differ significantly between the experimental control group and its matched group relate to race/ethnicity.

Table 3. Baseline statistics for treatment-group enrollees and propensity-score matched comparison group (Math outcome).

Table 4. Baseline statistics for control-group enrollees and propensity-score matched comparison group (Math outcome)

Exact Matching

Exact matching (EM) uses comparison group students who exactly match treatment students on a set of demographic characteristics and have very similar baseline test scores (e.g., see Woodworth and Raymond Citation2013). To be selected, the comparison students must exactly match the treatment students on the following categorical characteristics: baseline charter school attendance, sex, race/ethnicity, FRPL eligibility status, LEP status, IEP status, grade in outcome year, cohort, and jurisdiction. Exact matching on continuous characteristics—such as baseline math and reading test scores—would rarely identify matches, so we define a comparison student to be an exact match if his or her test score falls within 0.10 standard deviation of the treatment student's baseline test score in the same subject. We managed to find matches for 95% and 97% of the treatment students for the math and reading analysis samples, respectively. Following the creation of the matched comparison group, impacts are estimated using the same regression model used in the experimental and PSM analyses. (The exact matching analysis included only the treatment students who were matched to at least one comparison group student. Due to the differences in the treatment groups used for the experimental and exact match analysis our results are conservative in nature. However, this does not appear to significantly affect the results as we were able to achieve a high match rate (95% and 97%for math and reading outcomes, respectively) and the exact match method yielded estimates that are similar to the experimental estimates.)

Table 5. Experimental and nonexperimental average intent-to-treat (ITT) impact estimates in 12 CMO schools.

OLS Regression with Baseline Achievement Covariates

The OLS-only approach does not attempt to create a matched comparison group of students. Instead, the approach uses the entire population of non-CMO students in the local jurisdiction as comparisons, relying entirely on covariates to adjust for baseline differences between treatment students and other students. We use the same OLS regression model used to estimate impacts in all of the other approaches.

Results

The nonexperimental approaches successfully replicate average experimental ITT impact estimates for the 12 charter schools in the replication sample (). All three nonexperimental approaches produce ITT impact estimates that are small and are not statistically significantly different from corresponding experimental ITT estimate. Propensity-score matching produces ITT impact estimates within 0.01 of the experimental ITT estimate in math and 0.03 of the experimental ITT estimate in reading. Neither of the differences is statistically significant, and they differ in opposite directions (i.e., they do not consistently under- or over-estimate impacts). (When calculating the NXP standard errors, we make two assumptions, each of which has an opposite effect on the size of the standard errors. First, we assume zero covariance between the sampling errors of the treatment and control enrollees' impact estimates. Because the comparison students who are being matched to the two groups overlap—especially in the OLS analysis that incorporates all students in the same grade in the district—the covariance is actually positive. This results in the over-estimation of the SEs and p-values. Second, we assume no covariance between the sampling errors of site-specific impact estimates. Several sites occur within the same district, and thus comparison students overlap between sites, especially in the OLS analysis, creating a positive covariance between the sampling errors of the site impact estimates. This results in the under-estimation of the SEs and p-values.) The last two rows of show that exact matching and OLS, like propensity-score matching, produce impact estimates that are very close to experimental impact estimates.

Moreover, nonexperimental impact estimates at the site level are very similar to experimental site estimates, as the high correlations in indicate. At the site level, PSM ITT impact estimates correlate with experimental ITT impact estimates at 0.97 in math and 0.90 in reading. Site-level results are also similar to experimental impacts for exact matching (0.96 for math and 0.90 for reading) and OLS (0.99 for math and 0.88 for reading). For each nonexperimental method, the impact estimates do not consistently under- or overestimate experimental ITT impacts (i.e., there is no evidence that they are positively or negatively biased).

Table 6. Experimental and nonexperimental impact estimates at the site level

Conclusion and Implications

In this article, we develop a new approach to examine whether nonexperimental methods can replicate the ITT experimental estimates when there is substantial control group noncompliance (crossover). Our findings suggest that nonexperimental panel approaches (propensity-score matching, exact matching, and OLS regression) that follow subjects over time and incorporate pretreatment measures of the outcome can produce impact estimates that replicate ITT experimental estimates with a high degree of accuracy. Moreover, the nonexperimental estimates are neither higher nor lower than the experimental estimates, implying no systematic bias. Experimental ITT impacts remain the gold standard due to their transparent, minimal assumptions, but the repeated validations of nonexperimental methods indicate that well-conducted nonexperimental studies with baseline measures of the outcome of interest can often achieve sufficient internal validity.

The finding that nonexperimental methods using pretreatment measures of the outcome can successfully replicate experimental results is not novel. But this is the first study to develop a method for replicating ITT experimental impact estimates using nonexperimental approaches even in the context of substantial control group noncompliance, a common feature of many field experiments.

The extent to which nonexperimental studies might produce unbiased estimates of impacts on outcomes for which baseline measures are unavailable remains an open question (e.g., Glazerman, Levy, and Myers Citation2003). Additional research on this issue is merited, because some outcomes of interest are not measured repeatedly and therefore cannot be included as baseline control variables. In the context of schooling, these include attainment outcomes such as high-school graduation and enrollment in college. Whether baseline test scores can adequately control for selection related to student attainment is as yet unknown—a question to be addressed in future studies that have admissions lottery data and a long post-lottery time series.

With the growth of detailed administrative data on program participants, validation of nonexperimental methods will continue to be important for the development of knowledge of the impacts in the field of complex policies and interventions. In education and many other policy arenas, randomized experiments are unlikely to be able to address many critical policy questions. In some instances, exclusive reliance on randomized experimental results could lead to mistaken conclusions. For example, charter schools that are sufficiently oversubscribed to use admissions lotteries could be an unusually effective group of charter schools (as suggested by Abdulkadiroglu et al. Citation2013). Small-scale experiments may fail to capture systemic effects that become evident only at scale, as, for example, when the state of California used the results of a class-size reduction experiment to motivate a statewide class-size reduction policy, failing to anticipate unintended teacher labor-supply effects that occurred only as a result of large-scale implementation (see Stecher and Bohrnstedt Citation2001). This article should help future studies to validate nonexperimental methods that will be important in policy research in various contexts.

In particular, the method developed here should help researchers from going astray in designing WSCs to test nonexperimental approaches. When some members of the control group enter treatment, the standard 2SLS approach for estimating impacts involves a different group of treated subjects than those included in a nonexperimental approach. Researchers who conduct WSCs that involve replicating 2SLS CACE impact estimates therefore can reach the wrong conclusions about the success of the WSC. (As Cook et al. [Citation2008] note, WSCs ideally involve separate research teams conducting experimental and nonexperimental impact estimates, with each team blind to the results of the other team. Ensuring that the two teams are measuring the same causal estimand, however, is essential.) The method developed in this article allows nonexperimental methods to be tested against the most-rigorous ITT experimental impact estimates when control-group noncompliance (cross-over) exists, without confounding different groups of treatment subjects. This method should be generally applied in WSCs that involve control-group noncompliance.

References

  • Abdulkadiroğlu, A., Angrist, J. D., Dynarski, S. M., Kane, T. J., and Pathak, P. A. (2011), “Accountability and Flexibility in Public Schools: Evidence from Boston's Charters and Pilots,” The Quarterly Journal of Economics, 126, 699–748.
  • Angrist, J. D., Cohodes, S. A., Dynarski, S. M., Parak, P. A., and Walters, C. D. (2013), Charter Schools and the Road to College Readiness: The Effects on College Preparation, Attendance, and Choice, Boston, MA: The Boston Foundation and the NewSchools Venture Fund.
  • Angrist, J. D., Imbens, G. W., and Rubin, D. B. (1996), “Identification of Causal Effects Using Instrumental Variables,” Journal of the American Statistical Association, 91, 444–455.
  • Angrist, J. D., Pathak, P. A., and Walters, C. D. (2011), “Explaining Charter School Effectiveness,” National Bureau of Economic Research, NBER Working Paper 17332.
  • Bifulco, R. (2012), “Can Nonexperimental Estimates Replicate Estimates Based on Random Assignment in Evaluations of School Choice? A Within-Study Comparison,” Journal of Policy Analysis and Management, 31, 729–751.
  • Cook, T. D., Shadish, W. R., and Wong, V. C. (2008), “Three Conditions Under Which Experiments and Observational Studies Produce Comparable Causal Estimates: New Findings from Within-Study Comparisons,” Journal of Public Analysis and Management, 27, 724–750.
  • Dobbie, W., and Fryer, R. G. (2011a), “Are High-Quality Schools Enough to Increase Achievement Among the Poor? Evidence from the Harlem Children's Zone,” American Economic Journal: Applied Economics, 3, 158–187.
  • ——— (2011b), “Getting Beneath the Veil of Effective Schools: Evidence from New York City,” National Bureau of Economic Research, NBER Working Paper 17632.
  • Fortson, K., Verbitzy-Savitz, N., Kopa, E., and Gleason, P. (2012), Using an Experimental Evaluation of Charter Schools to Test Whether Nonexperimental Comparison Group Methods Can Replicate Experimental Impact Estimates (NCEE Technical Methods Report 2012–4019), Washington, DC: National Center for Education Evaluation and Regional Assistance, Institute of Education Sciences, U.S. Department of Education.
  • Furgeson, J., Gill, B., Haimson, J., Killewald, A., McCullough, M., Nichols-Barrer, I., Teh, B., Verbitsky-Savitz, N., Bowen, M., Demeritt, A., Hill, P., and Lake, R. (2012), Charter-School Management Organizations: Diverse Strategies and Diverse Student Impacts, Cambridge, MA: Mathematica Policy Research.
  • Glazerman, S., Levy, D. M., and Myers, D. (2003), “Nonexperimental Versus Experimental Estimates of Earnings Impacts,” The Annals of the American Academy of Political and Social Science, 589, 63–93.
  • Gleason, P., Clark, M., Tuttle, C. C., and Dwoyer, E. (2010), The Evaluation of Charter School Impacts: Final Report (NCEE 2010–4029), Washington, DC: National Center for Education Evaluation and Regional Assistance, Institute of Education Sciences, U.S. Department of Education.
  • Hollis, S., and Campbell, F. (1999), “What is Meant by Intention to Treat Analysis? Survey of Published Randomised Controlled Trials,” BMJ, 391, 670–674.
  • Krueger, A. (1999), “Experimental Estimates of Education Production Functions,” Quarterly Journal of Economics, 104, 497–532.
  • McKenzie, D., Gibson, J., and Stillman, S. (2007), How Important Is Selection? Experimental Versus Nonexperimental Measures of Income Gains from Migration, Washington, DC: World Bank.
  • Murnane, R. J., and Willet, J. B. (2010), Methods Matter: Improving Causal Inference in Educational and Social Science Research, New York: Oxford University Press.
  • Permutt, T., and Hebel, J. R. (1989), “Simultaneous Equation Estimation in a Clinical Trial of the Effect of Smoking on Birth Weight,” Biometrics, 45, 619–622.
  • Rosenbaum, P. R., and Rubin, D. B. (1983), “The Central Role of the Propensity Score in Observational Studies for Causal Effects,” Biometrika, 70, 41–45.
  • Schochet, P., Burghardt, J., and Glazerman, S. (1999), The National Job Corps Study: Short Term Impacts of Job Corps on Participants' Employment and Related Outcomes, Princeton, NJ: Mathematica Policy Research.
  • Shadish, W. R., Cook, T. D., and Campbell, D. T. (2002), Experimental and Quasi-Experimental Designs for Generalized Causal Inference, Boston, MA: Houghton Mifflin.
  • Stecher, B., and Bohrnstedt, G. (2001), “Class-Size Reduction in California: A Story of Hope, Promise, and Unintended Consequences,” Phi Delta Kappan, 82, 670–674.
  • Tuttle, C. C., Gleason, P., and Clark, M. (2012), “Using Lotteries to Evaluate Schools of Choice: Evidence from a National Study of Charter Schools,” Economics of Education Review, 31, 237–253.
  • Tuttle, C. C., Teh, B., Nichols-Barrer, I., Gill, B. P., and Gleason, P. (2011), Student Characteristics and Achievement in 22 KIPP Middle Schools, Washington, DC: Mathematica Policy Research.
  • Woodworth, J. L., and Raymond, M. E. (2013), Charter School Growth and Replication, Stanford, CA: CREDO.