1,559
Views
29
CrossRef citations to date
0
Altmetric
Original Articles

Challenges With Propensity Score Strategies in a High-Dimensional Setting and a Potential Alternative

, &
Pages 477-513 | Published online: 09 Jun 2011
 

Abstract

This article explores some of the challenges that arise when trying to implement propensity score strategies to answer a causal question using data with a large number of covariates. We discuss choices in propensity score estimation strategies, matching and weighting implementation strategies, balance diagnostics, and final analysis models. We demonstrate the wide range of estimates that can result from different combinations of these choices. Finally, an alternative estimation strategy is presented that may have benefits in terms of simplicity and reliability. These issues are explored in the context of an empirical example that uses data from the Early Childhood Longitudinal Study, Kindergarten Cohort to investigate the potential effect of grade retention after the 1st-grade year on subsequent cognitive outcomes.

Notes

1For another study that looks at the effect of first-grade retention on cognitive outcomes using different data but also conditioning on a relatively large number of covariates (72) see Wu, West, and Hughes (2008a, 2008b).

2In theory some sort of natural experiment might be used to answer this question. This has been done by a few other authors (notably CitationJacob & Lefgren, 2004) with access to different data. Such analyses, however, require use of data to which it is often quite difficult to get access. Moreover they tend to answer much more limited kinds of questions. For instance, the Jacob and Lefgren paper can only reliably make inferences about children close to the threshold for promotion.

3If the focus of this article was not methodological we would likely perform multiple imputation rather than listwise deletion to address these substantial missing data issues. We were loathe to muddy the water by introducing further methods that themselves could inspire controversy, however, so we decided instead to keep things simple.

4Doubly robust estimators gain strength by modeling both the treatment assignment mechanism (E[Z|X]) and the response surface (E[Y|Z, X]). This class of estimators has the property that if either of these mechanisms is estimated without bias then the overall estimator will have no bias. Propensity score matching followed by covariance adjustment (typically in the form of a linear model) is an informal way of accomplishing the same goal.

5 CitationPearl (2010) provides an argument against the common advice to simply control for as many pretreatment covariates as possible. He demonstrates that if one of those covariates is in fact a true instrument, such conditioning can lead to more rather than less bias. Part of the early work on the article in fact was an attempt to find an instrument lurking among our rich set of variables; we are reasonably certain that no such instrument exists. However, even if one does exist (in the sense of satisfying the untestable assumptions of being randomized and satisfying exclusion), then we have at least determined that no strong instrument exists. If a researcher does have access to such an instrument then by all means it should be used in an instrumental variables analysis (in addition perhaps to a standard analysis with the instrument excluded from the set of confounding covariates). Finally, we also argue more generally that when controlling for such a huge number of covariates it is likely that whatever conditional independence relationship might have existed, it would more than likely be destroyed by the conditioning that occurs on some subset of the rest of the pretreatment variables.

6This statement glosses over the fact that computationally the algorithms to fit these models can vary a bit between software packages. For instance, the logistic regression function in Stata (as well as the glm function using a logit link) was not able to fit the specified model, which R fit without incident.

7BART has also been proposed for causal inference as a strategy for directly estimating the response surface (CitationHill, 2011) and indeed it is used that way later in this article. In this section, however, BART is merely being used to estimate the propensity score as part of a broader propensity score strategy.

8Because our goal in this analysis is to estimate the effect of the treatment on the treated, this distinction might argue for excluding full matching from the strategies tested. On the other hand, the documentation for the popular propensity score matching package used (MatchIt) that implements this algorithm is sufficiently vague on this point that it would be easy for the typical user to misuse it in exactly this way. Also, if treatment effects are additive these two estimands will be equal.

9 CitationMartens (2007) proposes alternative summaries of this information, which could potentially be superior. We focus on the empirical QQ metrics in this article because they are available in existing software and thus are more likely to be used in current practice.

10 CitationDiamond and Sekhon (2008) have created a “genetic matching” algorithm that can be implemented in the Matching package in R (Sekhon, 2011) that performs this optimization. Given that the algorithm is not technically a propensity score matching approach (it matches using all the covariate data though typically performance is improved by also including the propensity score) and the fact that it can be quite computationally intensive (e.g., it was too big to run on a standard PC with these data), we did not include it among our set of typical propensity score methods attempted. It is a competitor to these methods that should be considered in smaller scale problems which has many of same advantages as the BART algorithm espoused later in the article in terms of simplicity.

11Unfortunately this huge number of terms reflects a substantial number of redundancies because the way that this option is implemented does not discriminate between categorical and continuous variables. So, for instance, the squared term for each variable is included even though for binary variables this is equivalent to the original term.

12Full matching is actually geared toward estimating the average treatment effect across the entire sample, not just the average effect for the treated. Therefore differences between estimates from this method and the others are a bit more complicated to interpret.

13The researcher may be able to specify a semiparametric or nonparametric model at this stage; however, even this would require decisions regarding tuning parameters. The nonparametric choice presented in the next section requires a minimum of such choices (or rather they are prespecified for the user). If the researcher is willing to invest in such a model at this stage, however, why not simply estimate the response surface directly rather than matching/weighting first?

14When BART's hyperparameters were chosen via cross validation it performed the best on average. Using the default BART settings for these hyperparameters (the practice we espouse here; also espoused by CitationHill, 2011), BART performed at least as well with regard to in- and out-of-sample prediction as the strongest current competitors in the data mining literature (neural nets, gradient boosting, random forests) and noticeably better than lasso even though the other methods get to choose their free parameters using cross validation.

15Probably no statistical procedure should be used without any sort of diagnostic. Appendix B describes some simple checks that can be used to help ensure that the BART fit is appropriate.

16Although we would caution against interpreting any individual level causal effects as that is probably asking too much from the model.

17As described in CitationHill (2011), this estimate is more directly applicable to the conditional average causal effect for the treated (CATT) as defined by CitationAbadie and Imbens (2002). However, if our sample is representative of the population it should also be unbiased for the population average treatment effect on the treated (PATT) and uncertainty estimates can be augmented to reflect our additional uncertainty in this setting.

18Moreover, although we have focused on some of the most important choices the researcher faces when implementing propensity score strategies, we have largely skirted the issue of common support and have completely ignored the somewhat contentious issue of variance estimation.

Log in via your institution

Log in to Taylor & Francis Online

PDF download + Online access

  • 48 hours access to article PDF & online version
  • Article PDF can be downloaded
  • Article PDF can be printed
USD 53.00 Add to cart

Issue Purchase

  • 30 days online access to complete issue
  • Article PDFs can be downloaded
  • Article PDFs can be printed
USD 352.00 Add to cart

* Local tax will be added as applicable

Related Research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.