Search in:

Statistical Theory and Related Fields Volume 2, 2018 - Issue 1

Submit an article Journal homepage

Free access

358

Views

CrossRef citations to date

Altmetric

Listen

ARTICLES

Impact of sufficient dimension reduction in nonparametric estimation of causal effect

Ying ZhangDepartment of Statistics, University of Wisconsin-Madison, Madison, WI, USAView further author information

Jun ShaoDepartment of Statistics, University of Wisconsin-Madison, Madison, WI, USA;School of Statistics, East China Normal University, Shanghai, People's Republic of ChinaView further author information

Menggang YuDepartment of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI, USAView further author information

Lei WangInstitute of Statistics and LPMC, Nankai University, Tianjin, People's Republic of ChinaCorrespondence[email protected]
View further author information

Pages 89-95 | Received 10 Nov 2017, Accepted 14 Apr 2018, Published online: 18 May 2018

Cite this article
https://doi.org/10.1080/24754269.2018.1466100
CrossMark

In this article

ABSTRACT
1. Introduction
2. Theory
3. Simulation
Acknowledgements
Disclosure statement
Additional information
References
Appendixes

Full Article
Figures & data
References
Citations
Metrics
Reprints & Permissions
View PDF PDF

ABSTRACT

We consider the estimation of causal treatment effect using nonparametric regression or inverse propensity weighting together with sufficient dimension reduction for searching low-dimensional covariate subsets. A special case of this problem is the estimation of a response effect with data having ignorable missing response values. An issue that is not well addressed in the literature is whether the estimation of the low-dimensional covariate subsets by sufficient dimension reduction has an impact on the asymptotic variance of the resulting causal effect estimator. With some incorrect or inaccurate statements, many researchers believe that the estimation of the low-dimensional covariate subsets by sufficient dimension reduction does not affect the asymptotic variance. We rigorously establish a result showing that this is not true unless the low-dimensional covariate subsets include some covariates superfluous for estimation, and including such covariates loses efficiency. Our theory is supplemented by some simulation results.

KEYWORDS:

Asymptotic variance
causal treatment effect
nonparametric regression or propensity weighting
-consistency

1. Introduction

Consider the estimation of an unknown parameter θ based on a sample of size n from a given population. Many estimators are of the form , a function of that is an estimator of another parameter λ, where both and are functions of the sample (e.g., Gong & Samaniego, Citation1981; Randles, Citation1982). Under some conditions both and are asymptotically normal with mean zero as n increases to infinity. A question of both theoretical and practical interest is whether the estimation efficiency is affected by the fact that λ is estimated, i.e., whether and have the same asymptotic variance. Examples with equal asymptotic variance were given in Raghavachari (Citation1965), Adichie (Citation1974), De Wet, and Van Wyk (Citation1979) and Randles (Citation1982). Examples in which and have different asymptotic variances can be found in Gong and Samaniego (Citation1981) and Randles (Citation1982).

In the problem of causal evaluation of treatment (Hahn, Citation1998, Citation2004; Hirano, Imbens, & Ridder, Citation2003; Imbens, Newey, & Ridder, Citation2006; Rosenbaum & Rubin, Citation1983; Wang & Chen, Citation2009; Wang, Citation2007), the previously described issue is not well addressed in the literature, and some incorrect or inaccurate statements are given with incorrect proofs. The problem can be described as follows. Let T be a binary treatment indicator, X be a p-dimensional vector of pretreatment covariates and be the potential outcome under treatment T=k. We focus on the causal effect , other causal effects such as quantile treatment effects can be similarly considered. Since only one treatment is applied, what we can observe is , not both and . Based on a random sample from the distribution of , we can estimate θ under the assumption that (Rosenbaum & Rubin, Citation1983), i.e., T and are independent conditional on X, k=0,1. In the special case where , this problem reduces to the well-known missing data problem where T=0 indicates a missing , , and is simply the missing at random assumption.

Estimators based on nonparametric regression or nonparametric inverse propensity weighting as described in Section 2 require almost no model assumption on but they do not perform well when the covariate dimension p is not very small. Since frequently only a few linear combinations of X are actually related with , it is attractive to first find a lower dimensional satisfying , where is a constant matrix with a small , k=0,1, and then apply nonparametric regression or inverse propensity weighting with X replaced by . If and are known, then the resulting estimator of θ is denoted as with . However, λ is usually unknown and a sufficient dimension reduction method (e.g., Cook & Weisberg, Citation1991; Li, Citation1991; Xia Tong, Li, & Zhu, Citation2002) is typically applied to estimate it by . Under some conditions, both and are asymptotically normal with mean zero and hence the relevant question is whether the estimation of λ by affects the asymptotic efficiency of estimating θ. There is no precise conclusion in the literature regarding this issue, but some researchers implicitly assume that using -consistent estimators of λ is asymptotically the same as using the true λ. For instance, Hu, Follmann, and Wang (Citation2014) and Deng & Wang (Citation2017) claimed that and have the same asymptotic variance if is -consistent, which is an incorrect conclusion in general. In a very recent publication, Luo, Zhu, and Ghosh (Citation2017) made the same wrong conclusion.

We rigorously establish a result showing that under the additional condition , k=0,1, . Although this condition is sufficient but not necessary for the asymptotic equivalence between and , we provide an example showing that without , k=0,1, and have different asymptotic variances. Our theory is supplemented by simulation results showing that can be substantially less efficient than . However, our simulation results also show that finding a satisfying the additional condition may not be a good idea, because, although the resulting estimator is not affected by the estimation of , may include some covariates superfluous for estimation and have an unnecessarily high dimension to lose efficiency.

2. Theory

To study the asymptotic behaviour of and , we first described three popular nonparametric estimators . We adopt the notation in Section 1. The regression method (Hu et al., Citation2014; Imbens et al., Citation2006) estimates through estimating the function by the usual kernel estimator , k=0,1, where , , , is a -dimensional kernel function and is the bandwidth. The regression estimator of θ is The inverse propensity weighting method (Imai & Ratkovic, Citation2014; Imbens, Citation2004; Kang & Schafer, Citation2007) estimates the probability by the kernel estimator , k=0,1, and obtains the following estimator of θ by inverse propensity weighting, However, this estimator often does not have good empirical performance and can be improved by the estimator combining the regression and inverse propensity weighting, the so-called augmented inverse propensity weighting estimator, In what follows, we use to denote one of , and . Under the conditions , , k=0,1, and some regularity conditions, it has been shown that is asymptotically normal with mean 0 and variance (1) (e.g., Hu et al., Citation2014; Luo et al., Citation2017; Wang & Chen, Citation2009; Wang, Citation2007).

Our main result is about the asymptotic behaviour of with a -consistent estimator of , which leads to a sufficient condition under which and are asymptotically equivalent. In the following, denotes a column vector whose components are elements of a matrix B and denotes a term converging to 0 in probability. A proof of the following theorem is given in the appendix.

Theorem 2.1

Assume and the regularity conditions in the appendix.

If is a -consistent estimator of then, (2) where (3)
If (4) for some functions with then, is asymptotically normal with mean 0 and variance (5) where is given by Equation (Equation1(1) ) and

Condition (Equation4(4) ) is satisfied for some sufficient dimension reduction methods (Hsing & Carroll, Citation1992; Zhu & Ng, Citation1995).

Theorem 2.1 shows that the asymptotic difference between and is related to the magnitude of , k=0,1, through Equation (Equation2(2) ). From the sufficient dimension reduction literature, is at most of the order . Hence, a sufficient condition under which and are asymptotically equivalent is that both and in Equation (Equation3(3) ) are equal to 0. By formula (Equation3(3) ), the only realistic situation where is when , which is implied by .

Hence, if we choose satisfying both and , then and are asymptotically equivalent, provided that Equation (Equation4(4) ) holds. However, we may pay a price for doing so, because satisfying the additional requirement may include some covariates superfluous for estimation and, thus, have an unnecessarily high dimension and lose efficiency. Let with satisfying , and let with satisfying both and . Although and are asymptotically equivalent, their asymptotic variance is given by Equation (Equation1(1) ) with replaced by , which is larger than when is larger than dim due to the extra requirement of . Furthermore, even when is less efficient than due to the estimation of λ, it may still be more efficient than . The following is an example for illustration.

Example 2.2

Let , , , where and are independent and uniform on the interval , and is independent of X. Let . Then satisfying , but not . Let . Then both and hold. However, dim dim, and contains that is not useful for estimating . In this case, , smaller than The vector defined by Equation (Equation3(3) ) is a two-dimensional vector whose first component is 0 and second component , so the asymptotic variance of given by Equation (Equation5(5) ) differs from . Calculating the asymptotic variance in Equation (Equation5(5) ) requires further information about .

In next section, we provide some numerical results for the variance in Equation (Equation5(5) ).

3. Simulation

To support our theory we investigate the finite-sample performances of and with two choices of discussed in Section 2, i.e., satisfies with smallest possible dim, and satisfies and with smallest possible dim. We consider estimators using the true and as well as estimated and by applying the sliced inverse regression method (Li, Citation1991). According to Theorem 2.1, estimators using the true and estimated have different asymptotic variances, whereas estimators using the true and estimated are asymptotically equivalent. We try two sample sizes, n=200 and n=1000. As in Hu et al. (Citation2014), the nonparametric kernel estimators and are computed using the rth order Gaussian product kernel with standardised covariates. The bandwidth we used here is (Chen, Wan, & Zhou, Citation2015; Hu et al., Citation2014).

We consider the following three simulation models.

with independent components, , , where 's are independent and are independent of X, and . The outcome models are linear in X and the log-conditional treatment odds is linear in X. Under this model, , and dim.
with independent components, , , where 's are independent and are independent of X, and . The outcome models are linear in X and the log-conditional treatment odds is linear in X. Under this model, but .
with independent components, , , where 's are independent and are independent of X, and . The outcome models are nonlinear in X and the log-conditional treatment odds is also nonlinear in X. Under this model, but .

Table shows the simulated relative bias and standard deviation in each scenario based on 1000 simulation runs. It can be seen that the simulation results are in agreement with the asymptotic result (Theorem 2.1), especially when n=1000, i.e., the SD of and are very close while the SD of and may be quite different. Although may be worse than , it may be better than ; hence, it is not a good idea to search for a that does not affect the asymptotic variance. Regarding the two different estimation methods, and have very comparable performances.

Table 1. Relative bias (RB) and standard deviation (SD) of based on 1000 simulations.

Display Table

Acknowledgments

We are grateful to the Editor, the Associate Editor and two anonymous referees for their insightful comments and suggestions on this article, which have led to significant improvements. The views in this publication are solely the responsibility of the authors and do not necessarily represent the views of the PCORI, its Board of Governors or Methodology Committee.

Disclosure statement

No potential conflict of interest was reported by the authors.

Notes on contributors

Ying Zhang is a PhD candidate, Department of Statistics, University of Wisconsin-Madison.

Dr Jun Shao holds a PhD in statistics from the University of Wisconsin-Madison. He is a professor of statistics at the University of Wisconsin-Madison. His research interests include variable selection and inference with high dimensional data, sample surveys, and missing data problems.

Dr Menggang Yu holds a PhD in biostatistics from the University of Michigan. He is now a professor of biostatistics at the University of Wisconsin-Madison. Besides developing statistical methodology related to cancer research and clinical trials, Dr Yu is also very interested in health services research.

Dr Lei Wang holds a PhD in statistics from East China Normal University. He is an assistant professor of statistics at Nankai University. His research interests include empirical likelihood and missing data problems.

Additional information

Funding

This research was partially supported through a Patient-Centered Outcomes Research Institute (PCORI) Award (ME-1409-21219). This research was also supported by the National Natural Science Foundation of China (11501208), Fundamental Research Funds for the Central Universities, National Social Science Foundation (13BTJ009), the Chinese 111 Project grant (B14019) and the U.S. National Science Foundation (DMS-1305474 and DMS-1612873).

References

Adichie, J. N. (1974). Rank score comparison of several regression parameters. The Annals of Statistics, 2, 396–402. doi: 10.1214/aos/1176342676
Web of Science ®Google Scholar
Chen, X., Wan, A. T. K., & Zhou, Y. (2015). Efficient quantile regression analysis with missing observations. Journal of the American Statistical Association, 110, 723–741. doi: 10.1080/01621459.2014.928219
Web of Science ®Google Scholar
Cook, R. D., & Weisberg, S. (1991). Discussion of ‘Sliced inverse regression for dimension reduction’. Journal of the American Statistical Association, 86, 328–332.
Web of Science ®Google Scholar
Deng, J., & Wang, Q. (2017). Dimension reduction estimation for probability density with data missing at random when covariables are present. Journal of Statistical Planning and Inference, 181, 11–29. doi: 10.1016/j.jspi.2016.08.007
Web of Science ®Google Scholar
De Wet, T., & Van Wyk, J. W. J. (1979). Efficiency and robustness of Hogg's adaptive trimmed means. Communications in Statistics: Theory and Methods, 8, 117–128. doi: 10.1080/03610927908827743
Web of Science ®Google Scholar
Gong, G., & Samaniego, F. J. (1981). Pseudo maximum likelihood estimation: Theory and applications. The Annals of Statistics, 9, 861–869. doi: 10.1214/aos/1176345526
Web of Science ®Google Scholar
Hahn, J. (1998). On the role of the propensity score in efficient semiparametric estimation of average treatment effects. Econometrica, 66, 315–331. doi: 10.2307/2998560
Web of Science ®Google Scholar
Hahn, J. (2004). Functional restriction and efficiency in causal inference. Review of Economics and Statistics, 86, 73–76. doi: 10.1162/003465304323023688
Web of Science ®Google Scholar
Hirano, K., Imbens, G. W., & Ridder, G. (2003). Efficient estimation of average treatment effects using the estimated propensity score. Econometrica, 71, 1161–1189. doi: 10.1111/1468-0262.00442
Web of Science ®Google Scholar
Hsing, T., & Carroll, R. J. (1992). An asymptotic theory for sliced inverse regression. The Annals of Statistics, 20, 1040–1061. doi: 10.1214/aos/1176348669
Web of Science ®Google Scholar
Hu, Z., Follmann, D. A., & Wang, N. (2014). Estimation of mean response via the effective balancing score. Biometrika, 101, 613–624. doi: 10.1093/biomet/asu022
PubMed Web of Science ®Google Scholar
Imai, K., & Ratkovic, M. (2014). Covariate balancing propensity score. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76, 243–263. doi: 10.1111/rssb.12027
Web of Science ®Google Scholar
Imbens, G. W. (2004). Nonparametric estimation of average treatment effects under exogeneity: A review. Review of Economics and Statistics, 86, 4–29. doi: 10.1162/003465304323023651
Web of Science ®Google Scholar
Imbens, G. W., Newey, W., & Ridder, G. (2006). Mean-squared-error calculations for average treatment effects (Working Paper).
Google Scholar
Kang, J. D. Y., & Schafer, J. L. (2007). Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Statistical Science, 22, 523–539. doi: 10.1214/07-STS227
Web of Science ®Google Scholar
Li, K. C. (1991). Sliced inverse regression for dimension reduction. Journal of the American Statistical Association, 86, 316–327. doi: 10.1080/01621459.1991.10475035
Web of Science ®Google Scholar
Luo, W., Zhu, Y., & Ghosh, D. (2017). On estimating regression-based causal effects using sufficient dimension reduction. Biometrika, 104, 51–65.
Web of Science ®Google Scholar
Raghavachari, M. (1965). On the efficiency of the normal scores test relative to the F-test. The Annals of Mathematical Statistics, 36, 1306–1307. doi: 10.1214/aoms/1177700005
Google Scholar
Randles, R. H. (1982). On the asymptotic normality of statistics with estimated parameters. The Annals of Statistics, 10, 462–474. doi: 10.1214/aos/1176345787
Web of Science ®Google Scholar
Rosenbaum, P. R., & Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70, 41–55. doi: 10.1093/biomet/70.1.41
Web of Science ®Google Scholar
Wang, D., & Chen, S. X. (2009). Empirical likelihood for estimating equations with missing values. The Annals of Statistics, 37, 490–517. doi: 10.1214/07-AOS585
Web of Science ®Google Scholar
Wang, Q. (2007). M-estimators based on inverse probability weighted estimating equations with response missing at random. Communications in Statistics: Theory and Methods, 36, 1091–1103. doi: 10.1080/03610920601076917
Web of Science ®Google Scholar
Xia, Y., Tong, H., Li, W. K., & Zhu, L. X. (2002). An adaptive estimation of dimension reduction space. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 64, 363–410. doi: 10.1111/1467-9868.03411
Web of Science ®Google Scholar
Zhu, L. X., & Ng, K. W. (1995). Asymptotics of sliced inverse regression. Statistica Sinica, 5, 727–736.
Web of Science ®Google Scholar

Appendix

The following regularity conditions are assumed for Theorem 2.1, where conditions (1)–(4) are the same as conditions C1–C5 in Wang & Chen (Citation2009) with X replaced by , k=0,1:

is bounded away from 0 and 1.
The propensity function , the -density function and all have bounded partial derivatives with respect to up to order with , where is the order of the kernel .
.
The smoothing bandwidth satisfies and as .
The kernel is bounded up to the second-order derivative.
The smoothing bandwidth satisfies as .

Proof of Theorem 2.1

For purposes of simplicity, we focus only on the proof for regression type estimator with and show the difference of first term in regression estimator between using true and estimated . Denote , , , , as h, , , , B, respectively, and define in the following proof. Let ; it can be verified that where Using a Taylor expansion around for and plugging in , we have Denote and . Simple calculation entails that and Therefore, where It can be seen that the first term in will be equal to 0 if , while the second term in will be equal to 0 if . Thus, it leads to when both and hold.

Let and . We have Thus, , which leads to For , we also use a Taylor expansion for : We then decompose by conditioning on indexes i, j, that is, we define

Since using a similar decomposition method as , we can also show and . Theorem 2.1 is proved.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Order Reprints Request Corporate Permissions

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

Request Academic Permissions

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.

Download PDF

Share icon
Back to Top

Related research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.

People also read
Recommended articles
Cited by

To cite this article:

Reference style: APA Chicago Harvard

Citation copied to clipboard

Reference styles above use APA (6th edition), Chicago (16th edition) & Harvard (10th edition)

Download citation

Download a citation file in RIS format that can be imported by citation management software including EndNote, ProCite, RefWorks and Reference Manager.

Choose format: RIS BibTex RefWorks Direct Export

Choose options: Citation Citation & abstract Citation & references

Your download is now in progress and you may close this window

Did you know that with a free Taylor & Francis Online account you can gain access to the following benefits?

Choose new content alerts to be informed about new research of interest to you
Easy remote access to your institution's subscriptions on any device, from any location
Save your searches and schedule alerts to send you new results
Export your search results into a .csv file to support your research

Have an account?
Login now Don't have an account?
Register for free

Login or register to access this feature

Have an account?
Login now Don't have an account?
Register for free

Choose new content alerts to be informed about new research of interest to you
Easy remote access to your institution's subscriptions on any device, from any location
Save your searches and schedule alerts to send you new results
Export your search results into a .csv file to support your research

Impact of sufficient dimension reduction in nonparametric estimation of causal effect

ABSTRACT

1. Introduction

2. Theory

3. Simulation

Table 1. Relative bias (RB) and standard deviation (SD) of based on 1000 simulations.

Acknowledgments

Disclosure statement

References

Appendix

Information for

Open access

Opportunities

Help and information

Impact of sufficient dimension reduction in nonparametric estimation of causal effect

ABSTRACT

1. Introduction

2. Theory

3. Simulation

Table 1. Relative bias (RB) and standard deviation (SD) of based on 1000 simulations.

Acknowledgments

Disclosure statement

Notes on contributors

Additional information

Funding

References

Appendix

Reprints and Corporate Permissions

Academic Permissions

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date