Challenges With Propensity Score Strategies in a High-Dimensional Setting and a Potential Alternative: Multivariate Behavioral Research: Vol 46 , No 3

Abstract

This article explores some of the challenges that arise when trying to implement propensity score strategies to answer a causal question using data with a large number of covariates. We discuss choices in propensity score estimation strategies, matching and weighting implementation strategies, balance diagnostics, and final analysis models. We demonstrate the wide range of estimates that can result from different combinations of these choices. Finally, an alternative estimation strategy is presented that may have benefits in terms of simplicity and reliability. These issues are explored in the context of an empirical example that uses data from the Early Childhood Longitudinal Study, Kindergarten Cohort to investigate the potential effect of grade retention after the 1st-grade year on subsequent cognitive outcomes.

Notes

¹For another study that looks at the effect of first-grade retention on cognitive outcomes using different data but also conditioning on a relatively large number of covariates (72) see Wu, West, and Hughes (2008a, 2008b).

²In theory some sort of natural experiment might be used to answer this question. This has been done by a few other authors (notably CitationJacob & Lefgren, 2004) with access to different data. Such analyses, however, require use of data to which it is often quite difficult to get access. Moreover they tend to answer much more limited kinds of questions. For instance, the Jacob and Lefgren paper can only reliably make inferences about children close to the threshold for promotion.

³If the focus of this article was not methodological we would likely perform multiple imputation rather than listwise deletion to address these substantial missing data issues. We were loathe to muddy the water by introducing further methods that themselves could inspire controversy, however, so we decided instead to keep things simple.

⁴Doubly robust estimators gain strength by modeling both the treatment assignment mechanism (E[Z|X]) and the response surface (E[Y|Z, X]). This class of estimators has the property that if either of these mechanisms is estimated without bias then the overall estimator will have no bias. Propensity score matching followed by covariance adjustment (typically in the form of a linear model) is an informal way of accomplishing the same goal.

⁵ CitationPearl (2010) provides an argument against the common advice to simply control for as many pretreatment covariates as possible. He demonstrates that if one of those covariates is in fact a true instrument, such conditioning can lead to more rather than less bias. Part of the early work on the article in fact was an attempt to find an instrument lurking among our rich set of variables; we are reasonably certain that no such instrument exists. However, even if one does exist (in the sense of satisfying the untestable assumptions of being randomized and satisfying exclusion), then we have at least determined that no strong instrument exists. If a researcher does have access to such an instrument then by all means it should be used in an instrumental variables analysis (in addition perhaps to a standard analysis with the instrument excluded from the set of confounding covariates). Finally, we also argue more generally that when controlling for such a huge number of covariates it is likely that whatever conditional independence relationship might have existed, it would more than likely be destroyed by the conditioning that occurs on some subset of the rest of the pretreatment variables.

⁶This statement glosses over the fact that computationally the algorithms to fit these models can vary a bit between software packages. For instance, the logistic regression function in Stata (as well as the glm function using a logit link) was not able to fit the specified model, which R fit without incident.

⁷BART has also been proposed for causal inference as a strategy for directly estimating the response surface (CitationHill, 2011) and indeed it is used that way later in this article. In this section, however, BART is merely being used to estimate the propensity score as part of a broader propensity score strategy.

⁸Because our goal in this analysis is to estimate the effect of the treatment on the treated, this distinction might argue for excluding full matching from the strategies tested. On the other hand, the documentation for the popular propensity score matching package used (MatchIt) that implements this algorithm is sufficiently vague on this point that it would be easy for the typical user to misuse it in exactly this way. Also, if treatment effects are additive these two estimands will be equal.

⁹ CitationMartens (2007) proposes alternative summaries of this information, which could potentially be superior. We focus on the empirical QQ metrics in this article because they are available in existing software and thus are more likely to be used in current practice.

¹⁰ CitationDiamond and Sekhon (2008) have created a “genetic matching” algorithm that can be implemented in the Matching package in R (Sekhon, 2011) that performs this optimization. Given that the algorithm is not technically a propensity score matching approach (it matches using all the covariate data though typically performance is improved by also including the propensity score) and the fact that it can be quite computationally intensive (e.g., it was too big to run on a standard PC with these data), we did not include it among our set of typical propensity score methods attempted. It is a competitor to these methods that should be considered in smaller scale problems which has many of same advantages as the BART algorithm espoused later in the article in terms of simplicity.

¹¹Unfortunately this huge number of terms reflects a substantial number of redundancies because the way that this option is implemented does not discriminate between categorical and continuous variables. So, for instance, the squared term for each variable is included even though for binary variables this is equivalent to the original term.

¹²Full matching is actually geared toward estimating the average treatment effect across the entire sample, not just the average effect for the treated. Therefore differences between estimates from this method and the others are a bit more complicated to interpret.

¹³The researcher may be able to specify a semiparametric or nonparametric model at this stage; however, even this would require decisions regarding tuning parameters. The nonparametric choice presented in the next section requires a minimum of such choices (or rather they are prespecified for the user). If the researcher is willing to invest in such a model at this stage, however, why not simply estimate the response surface directly rather than matching/weighting first?

¹⁴When BART's hyperparameters were chosen via cross validation it performed the best on average. Using the default BART settings for these hyperparameters (the practice we espouse here; also espoused by CitationHill, 2011), BART performed at least as well with regard to in- and out-of-sample prediction as the strongest current competitors in the data mining literature (neural nets, gradient boosting, random forests) and noticeably better than lasso even though the other methods get to choose their free parameters using cross validation.

¹⁵Probably no statistical procedure should be used without any sort of diagnostic. Appendix B describes some simple checks that can be used to help ensure that the BART fit is appropriate.

¹⁶Although we would caution against interpreting any individual level causal effects as that is probably asking too much from the model.

¹⁷As described in CitationHill (2011), this estimate is more directly applicable to the conditional average causal effect for the treated (CATT) as defined by CitationAbadie and Imbens (2002). However, if our sample is representative of the population it should also be unbiased for the population average treatment effect on the treated (PATT) and uncertainty estimates can be augmented to reflect our additional uncertainty in this setting.

¹⁸Moreover, although we have focused on some of the most important choices the researcher faces when implementing propensity score strategies, we have largely skirted the issue of common support and have completely ignored the somewhat contentious issue of variance estimation.

Challenges With Propensity Score Strategies in a High-Dimensional Setting and a Potential Alternative

Log in via your institution

Log in to Taylor & Francis Online

Restore content access

Related Research

Information for

Open access

Opportunities

Help and information

Challenges With Propensity Score Strategies in a High-Dimensional Setting and a Potential Alternative

Abstract

Notes

Log in via your institution

Log in to Taylor & Francis Online

Log in to Taylor & Francis Online

Restore content access

Related Research

To cite this article:

Download citation

Information for

Open access

Opportunities

Help and information

Keep up to date

Your download is now in progress and you may close this window

Login or register to access this feature