358
Views
0
CrossRef citations to date
0
Altmetric
ARTICLES

Impact of sufficient dimension reduction in nonparametric estimation of causal effect

, , &
Pages 89-95 | Received 10 Nov 2017, Accepted 14 Apr 2018, Published online: 18 May 2018

ABSTRACT

We consider the estimation of causal treatment effect using nonparametric regression or inverse propensity weighting together with sufficient dimension reduction for searching low-dimensional covariate subsets. A special case of this problem is the estimation of a response effect with data having ignorable missing response values. An issue that is not well addressed in the literature is whether the estimation of the low-dimensional covariate subsets by sufficient dimension reduction has an impact on the asymptotic variance of the resulting causal effect estimator. With some incorrect or inaccurate statements, many researchers believe that the estimation of the low-dimensional covariate subsets by sufficient dimension reduction does not affect the asymptotic variance. We rigorously establish a result showing that this is not true unless the low-dimensional covariate subsets include some covariates superfluous for estimation, and including such covariates loses efficiency. Our theory is supplemented by some simulation results.

1. Introduction

Consider the estimation of an unknown parameter θ based on a sample of size n from a given population. Many estimators are of the form , a function of that is an estimator of another parameter λ, where both and are functions of the sample (e.g., Gong & Samaniego, Citation1981; Randles, Citation1982). Under some conditions both and are asymptotically normal with mean zero as n increases to infinity. A question of both theoretical and practical interest is whether the estimation efficiency is affected by the fact that λ is estimated, i.e., whether and have the same asymptotic variance. Examples with equal asymptotic variance were given in Raghavachari (Citation1965), Adichie (Citation1974), De Wet, and Van Wyk (Citation1979) and Randles (Citation1982). Examples in which and have different asymptotic variances can be found in Gong and Samaniego (Citation1981) and Randles (Citation1982).

In the problem of causal evaluation of treatment (Hahn, Citation1998Citation2004; Hirano, Imbens, & Ridder, Citation2003; Imbens, Newey, & Ridder, Citation2006; Rosenbaum & Rubin, Citation1983; Wang & Chen, Citation2009; Wang, Citation2007), the previously described issue is not well addressed in the literature, and some incorrect or inaccurate statements are given with incorrect proofs. The problem can be described as follows. Let T be a binary treatment indicator, X be a p-dimensional vector of pretreatment covariates and be the potential outcome under treatment T=k. We focus on the causal effect , other causal effects such as quantile treatment effects can be similarly considered. Since only one treatment is applied, what we can observe is , not both and . Based on a random sample from the distribution of , we can estimate θ under the assumption that (Rosenbaum & Rubin, Citation1983), i.e., T and are independent conditional on X, k=0,1. In the special case where , this problem reduces to the well-known missing data problem where T=0 indicates a missing , , and is simply the missing at random assumption.

Estimators based on nonparametric regression or nonparametric inverse propensity weighting as described in Section 2 require almost no model assumption on but they do not perform well when the covariate dimension p is not very small. Since frequently only a few linear combinations of X are actually related with , it is attractive to first find a lower dimensional satisfying , where is a constant matrix with a small , k=0,1, and then apply nonparametric regression or inverse propensity weighting with X replaced by . If and are known, then the resulting estimator of θ is denoted as with . However, λ is usually unknown and a sufficient dimension reduction method (e.g., Cook & Weisberg, Citation1991; Li, Citation1991; Xia Tong, Li, & Zhu, Citation2002) is typically applied to estimate it by . Under some conditions, both and are asymptotically normal with mean zero and hence the relevant question is whether the estimation of λ by affects the asymptotic efficiency of estimating θ. There is no precise conclusion in the literature regarding this issue, but some researchers implicitly assume that using -consistent estimators of λ is asymptotically the same as using the true λ. For instance, Hu, Follmann, and Wang (Citation2014) and Deng & Wang (Citation2017) claimed that and have the same asymptotic variance if is -consistent, which is an incorrect conclusion in general. In a very recent publication, Luo, Zhu, and Ghosh (Citation2017) made the same wrong conclusion.

We rigorously establish a result showing that under the additional condition , k=0,1, . Although this condition is sufficient but not necessary for the asymptotic equivalence between and , we provide an example showing that without , k=0,1, and have different asymptotic variances. Our theory is supplemented by simulation results showing that can be substantially less efficient than . However, our simulation results also show that finding a satisfying the additional condition may not be a good idea, because, although the resulting estimator is not affected by the estimation of , may include some covariates superfluous for estimation and have an unnecessarily high dimension to lose efficiency.

2. Theory

To study the asymptotic behaviour of and , we first described three popular nonparametric estimators . We adopt the notation in Section 1. The regression method (Hu et al., Citation2014; Imbens et al., Citation2006) estimates through estimating the function by the usual kernel estimator , k=0,1, where , , , is a -dimensional kernel function and is the bandwidth. The regression estimator of θ is The inverse propensity weighting method (Imai & Ratkovic, Citation2014; Imbens, Citation2004; Kang & Schafer, Citation2007) estimates the probability by the kernel estimator , k=0,1, and obtains the following estimator of θ by inverse propensity weighting, However, this estimator often does not have good empirical performance and can be improved by the estimator combining the regression and inverse propensity weighting, the so-called augmented inverse propensity weighting estimator, In what follows, we use to denote one of , and . Under the conditions , , k=0,1, and some regularity conditions, it has been shown that is asymptotically normal with mean 0 and variance (1) (e.g., Hu et al., Citation2014; Luo et al., Citation2017; Wang & Chen, Citation2009; Wang, Citation2007).

Our main result is about the asymptotic behaviour of with a -consistent estimator of , which leads to a sufficient condition under which and are asymptotically equivalent. In the following, denotes a column vector whose components are elements of a matrix B and denotes a term converging to 0 in probability. A proof of the following theorem is given in the appendix.

Theorem 2.1

Assume and the regularity conditions in the appendix.

  1. If is a -consistent estimator of then, (2) where (3)

  2. If (4) for some functions with then, is asymptotically normal with mean 0 and variance (5) where is given by Equation (Equation1) and

Condition (Equation4) is satisfied for some sufficient dimension reduction methods (Hsing & Carroll, Citation1992; Zhu & Ng, Citation1995).

Theorem 2.1 shows that the asymptotic difference between and is related to the magnitude of , k=0,1, through Equation (Equation2). From the sufficient dimension reduction literature, is at most of the order . Hence, a sufficient condition under which and are asymptotically equivalent is that both and in Equation (Equation3) are equal to 0. By formula (Equation3), the only realistic situation where is when , which is implied by .

Hence, if we choose satisfying both and , then and are asymptotically equivalent, provided that Equation (Equation4) holds. However, we may pay a price for doing so, because satisfying the additional requirement may include some covariates superfluous for estimation and, thus, have an unnecessarily high dimension and lose efficiency. Let with satisfying , and let with satisfying both and . Although and are asymptotically equivalent, their asymptotic variance is given by Equation (Equation1) with replaced by , which is larger than when is larger than dim due to the extra requirement of . Furthermore, even when is less efficient than due to the estimation of λ, it may still be more efficient than . The following is an example for illustration.

Example 2.2

Let , , , where and are independent and uniform on the interval , and is independent of X. Let . Then satisfying , but not . Let . Then both and hold. However, dim dim, and contains that is not useful for estimating . In this case, , smaller than The vector defined by Equation (Equation3) is a two-dimensional vector whose first component is 0 and second component , so the asymptotic variance of given by Equation (Equation5) differs from . Calculating the asymptotic variance in Equation (Equation5) requires further information about .

In next section, we provide some numerical results for the variance in Equation  (Equation5).

3. Simulation

To support our theory we investigate the finite-sample performances of and with two choices of discussed in Section 2, i.e., satisfies with smallest possible dim, and satisfies and with smallest possible dim. We consider estimators using the true and as well as estimated and by applying the sliced inverse regression method (Li, Citation1991). According to Theorem 2.1, estimators using the true and estimated have different asymptotic variances, whereas estimators using the true and estimated are asymptotically equivalent. We try two sample sizes, n=200 and n=1000. As in Hu et al. (Citation2014), the nonparametric kernel estimators and are computed using the rth order Gaussian product kernel with standardised covariates. The bandwidth we used here is (Chen, Wan, & Zhou, Citation2015; Hu et al., Citation2014).

We consider the following three simulation models.

  1. with independent components, , , where 's are independent and are independent of X, and . The outcome models are linear in X and the log-conditional treatment odds is linear in X. Under this model, , and dim.

  2. with independent components, , , where 's are independent and are independent of X, and . The outcome models are linear in X and the log-conditional treatment odds is linear in X. Under this model, but .

  3. with independent components, , , where 's are independent and are independent of X, and . The outcome models are nonlinear in X and the log-conditional treatment odds is also nonlinear in X. Under this model, but .

Table  shows the simulated relative bias and standard deviation in each scenario based on 1000 simulation runs. It can be seen that the simulation results are in agreement with the asymptotic result (Theorem 2.1), especially when n=1000, i.e., the SD of and are very close while the SD of and may be quite different. Although may be worse than , it may be better than ; hence, it is not a good idea to search for a that does not affect the asymptotic variance. Regarding the two different estimation methods, and have very comparable performances.

Table 1. Relative bias (RB) and standard deviation (SD) of based on 1000 simulations.

Acknowledgments

We are grateful to the Editor, the Associate Editor and two anonymous referees for their insightful comments and suggestions on this article, which have led to significant improvements. The views in this publication are solely the responsibility of the authors and do not necessarily represent the views of the PCORI, its Board of Governors or Methodology Committee.

Disclosure statement

No potential conflict of interest was reported by the authors.

Notes on contributors

Ying Zhang is a PhD candidate, Department of Statistics, University of Wisconsin-Madison.

Dr Jun Shao holds a PhD in statistics from the University of Wisconsin-Madison. He is a professor of statistics at the University of Wisconsin-Madison. His research interests include variable selection and inference with high dimensional data, sample surveys, and missing data problems.

Dr Menggang Yu holds a PhD in biostatistics from the University of Michigan. He is now a professor of biostatistics at the University of Wisconsin-Madison. Besides developing statistical methodology related to cancer research and clinical trials, Dr Yu is also very interested in health services research.

Dr Lei Wang holds a PhD in statistics from East China Normal University. He is an assistant professor of statistics at Nankai University. His research interests include empirical likelihood and missing data problems.

Additional information

Funding

This research was partially supported through a Patient-Centered Outcomes Research Institute (PCORI) Award (ME-1409-21219). This research was also supported by the National Natural Science Foundation of China (11501208), Fundamental Research Funds for the Central Universities, National Social Science Foundation (13BTJ009), the Chinese 111 Project grant (B14019) and the U.S. National Science Foundation (DMS-1305474 and DMS-1612873).

References

  • Adichie, J. N. (1974). Rank score comparison of several regression parameters. The Annals of Statistics, 2, 396–402. doi: 10.1214/aos/1176342676
  • Chen, X., Wan, A. T. K., & Zhou, Y. (2015). Efficient quantile regression analysis with missing observations. Journal of the American Statistical Association, 110, 723–741. doi: 10.1080/01621459.2014.928219
  • Cook, R. D., & Weisberg, S. (1991). Discussion of ‘Sliced inverse regression for dimension reduction’. Journal of the American Statistical Association, 86, 328–332.
  • Deng, J., & Wang, Q. (2017). Dimension reduction estimation for probability density with data missing at random when covariables are present. Journal of Statistical Planning and Inference, 181, 11–29. doi: 10.1016/j.jspi.2016.08.007
  • De Wet, T., & Van Wyk, J. W. J. (1979). Efficiency and robustness of Hogg's adaptive trimmed means. Communications in Statistics: Theory and Methods, 8, 117–128. doi: 10.1080/03610927908827743
  • Gong, G., & Samaniego, F. J. (1981). Pseudo maximum likelihood estimation: Theory and applications. The Annals of Statistics, 9, 861–869. doi: 10.1214/aos/1176345526
  • Hahn, J. (1998). On the role of the propensity score in efficient semiparametric estimation of average treatment effects. Econometrica, 66, 315–331. doi: 10.2307/2998560
  • Hahn, J. (2004). Functional restriction and efficiency in causal inference. Review of Economics and Statistics, 86, 73–76. doi: 10.1162/003465304323023688
  • Hirano, K., Imbens, G. W., & Ridder, G. (2003). Efficient estimation of average treatment effects using the estimated propensity score. Econometrica, 71, 1161–1189. doi: 10.1111/1468-0262.00442
  • Hsing, T., & Carroll, R. J. (1992). An asymptotic theory for sliced inverse regression. The Annals of Statistics, 20, 1040–1061. doi: 10.1214/aos/1176348669
  • Hu, Z., Follmann, D. A., & Wang, N. (2014). Estimation of mean response via the effective balancing score. Biometrika, 101, 613–624. doi: 10.1093/biomet/asu022
  • Imai, K., & Ratkovic, M. (2014). Covariate balancing propensity score. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76, 243–263. doi: 10.1111/rssb.12027
  • Imbens, G. W. (2004). Nonparametric estimation of average treatment effects under exogeneity: A review. Review of Economics and Statistics, 86, 4–29. doi: 10.1162/003465304323023651
  • Imbens, G. W., Newey, W., & Ridder, G. (2006). Mean-squared-error calculations for average treatment effects (Working Paper).
  • Kang, J. D. Y., & Schafer, J. L. (2007). Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Statistical Science, 22, 523–539. doi: 10.1214/07-STS227
  • Li, K. C. (1991). Sliced inverse regression for dimension reduction. Journal of the American Statistical Association, 86, 316–327. doi: 10.1080/01621459.1991.10475035
  • Luo, W., Zhu, Y., & Ghosh, D. (2017). On estimating regression-based causal effects using sufficient dimension reduction. Biometrika, 104, 51–65.
  • Raghavachari, M. (1965). On the efficiency of the normal scores test relative to the F-test. The Annals of Mathematical Statistics, 36, 1306–1307. doi: 10.1214/aoms/1177700005
  • Randles, R. H. (1982). On the asymptotic normality of statistics with estimated parameters. The Annals of Statistics, 10, 462–474. doi: 10.1214/aos/1176345787
  • Rosenbaum, P. R., & Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70, 41–55. doi: 10.1093/biomet/70.1.41
  • Wang, D., & Chen, S. X. (2009). Empirical likelihood for estimating equations with missing values. The Annals of Statistics, 37, 490–517. doi: 10.1214/07-AOS585
  • Wang, Q. (2007). M-estimators based on inverse probability weighted estimating equations with response missing at random. Communications in Statistics: Theory and Methods, 36, 1091–1103. doi: 10.1080/03610920601076917
  • Xia, Y., Tong, H., Li, W. K., & Zhu, L. X. (2002). An adaptive estimation of dimension reduction space. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 64, 363–410. doi: 10.1111/1467-9868.03411
  • Zhu, L. X., & Ng, K. W. (1995). Asymptotics of sliced inverse regression. Statistica Sinica, 5, 727–736.

Appendix

The following regularity conditions are assumed for Theorem 2.1, where conditions (1)–(4) are the same as conditions C1–C5 in Wang & Chen (Citation2009) with X replaced by , k=0,1:

  1. is bounded away from 0 and 1.

  2. The propensity function , the -density function and all have bounded partial derivatives with respect to up to order with , where is the order of the kernel .

  3. .

  4. The smoothing bandwidth satisfies and as .

  5. The kernel is bounded up to the second-order derivative.

  6. The smoothing bandwidth satisfies as .

Proof of Theorem 2.1

For purposes of simplicity, we focus only on the proof for regression type estimator with and show the difference of first term in regression estimator between using true and estimated . Denote , , , , as h, , , , B, respectively, and define in the following proof. Let ; it can be verified that where Using a Taylor expansion around for and plugging in , we have Denote and . Simple calculation entails that and Therefore, where It can be seen that the first term in will be equal to 0 if , while the second term in will be equal to 0 if . Thus, it leads to when both and hold.

Let and . We have Thus, , which leads to For , we also use a Taylor expansion for : We then decompose by conditioning on indexes i, j, that is, we define

Since using a similar decomposition method as , we can also show and . Theorem 2.1 is proved.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.