ABSTRACT
We consider the estimation of causal treatment effect using nonparametric regression or inverse propensity weighting together with sufficient dimension reduction for searching low-dimensional covariate subsets. A special case of this problem is the estimation of a response effect with data having ignorable missing response values. An issue that is not well addressed in the literature is whether the estimation of the low-dimensional covariate subsets by sufficient dimension reduction has an impact on the asymptotic variance of the resulting causal effect estimator. With some incorrect or inaccurate statements, many researchers believe that the estimation of the low-dimensional covariate subsets by sufficient dimension reduction does not affect the asymptotic variance. We rigorously establish a result showing that this is not true unless the low-dimensional covariate subsets include some covariates superfluous for estimation, and including such covariates loses efficiency. Our theory is supplemented by some simulation results.
1. Introduction
Consider the estimation of an unknown parameter θ based on a sample of size n from a given population. Many estimators are of the form , a function of
that is an estimator of another parameter λ, where both
and
are functions of the sample (e.g., Gong & Samaniego, Citation1981; Randles, Citation1982). Under some conditions both
and
are asymptotically normal with mean zero as n increases to infinity. A question of both theoretical and practical interest is whether the estimation efficiency is affected by the fact that λ is estimated, i.e., whether
and
have the same asymptotic variance. Examples with equal asymptotic variance were given in Raghavachari (Citation1965), Adichie (Citation1974), De Wet, and Van Wyk (Citation1979) and Randles (Citation1982). Examples in which
and
have different asymptotic variances can be found in Gong and Samaniego (Citation1981) and Randles (Citation1982).
In the problem of causal evaluation of treatment (Hahn, Citation1998, Citation2004; Hirano, Imbens, & Ridder, Citation2003; Imbens, Newey, & Ridder, Citation2006; Rosenbaum & Rubin, Citation1983; Wang & Chen, Citation2009; Wang, Citation2007), the previously described issue is not well addressed in the literature, and some incorrect or inaccurate statements are given with incorrect proofs. The problem can be described as follows. Let T be a binary treatment indicator, X be a p-dimensional vector of pretreatment covariates and be the potential outcome under treatment T=k. We focus on the causal effect
, other causal effects such as quantile treatment effects can be similarly considered. Since only one treatment is applied, what we can observe is
, not both
and
. Based on a random sample from the distribution of
, we can estimate θ under the assumption that
(Rosenbaum & Rubin, Citation1983), i.e., T and
are independent conditional on X, k=0,1. In the special case where
, this problem reduces to the well-known missing data problem where T=0 indicates a missing
,
, and
is simply the missing at random assumption.
Estimators based on nonparametric regression or nonparametric inverse propensity weighting as described in Section 2 require almost no model assumption on but they do not perform well when the covariate dimension p is not very small. Since frequently only a few linear combinations of X are actually related with
, it is attractive to first find a lower dimensional
satisfying
, where
is a
constant matrix with a small
, k=0,1, and then apply nonparametric regression or inverse propensity weighting with X replaced by
. If
and
are known, then the resulting estimator of θ is denoted as
with
. However, λ is usually unknown and a sufficient dimension reduction method (e.g., Cook & Weisberg, Citation1991; Li, Citation1991; Xia Tong, Li, & Zhu, Citation2002) is typically applied to estimate it by
. Under some conditions, both
and
are asymptotically normal with mean zero and hence the relevant question is whether the estimation of λ by
affects the asymptotic efficiency of estimating θ. There is no precise conclusion in the literature regarding this issue, but some researchers implicitly assume that using
-consistent estimators of λ is asymptotically the same as using the true λ. For instance, Hu, Follmann, and Wang (Citation2014) and Deng & Wang (Citation2017) claimed that
and
have the same asymptotic variance if
is
-consistent, which is an incorrect conclusion in general. In a very recent publication, Luo, Zhu, and Ghosh (Citation2017) made the same wrong conclusion.
We rigorously establish a result showing that under the additional condition , k=0,1,
. Although this condition is sufficient but not necessary for the asymptotic equivalence between
and
, we provide an example showing that without
, k=0,1,
and
have different asymptotic variances. Our theory is supplemented by simulation results showing that
can be substantially less efficient than
. However, our simulation results also show that finding a
satisfying the additional condition
may not be a good idea, because, although the resulting estimator is not affected by the estimation of
,
may include some covariates superfluous for estimation and have an unnecessarily high dimension to lose efficiency.
2. Theory
To study the asymptotic behaviour of and
, we first described three popular nonparametric estimators
. We adopt the notation in Section 1. The regression method (Hu et al., Citation2014; Imbens et al., Citation2006) estimates
through estimating the function
by the usual kernel estimator
, k=0,1, where
,
,
,
is a
-dimensional kernel function and
is the bandwidth. The regression estimator of θ is
The inverse propensity weighting method (Imai & Ratkovic, Citation2014; Imbens, Citation2004; Kang & Schafer, Citation2007) estimates the probability
by the kernel estimator
, k=0,1, and obtains the following estimator of θ by inverse propensity weighting,
However, this estimator often does not have good empirical performance and can be improved by the estimator combining the regression and inverse propensity weighting, the so-called augmented inverse propensity weighting estimator,
In what follows, we use
to denote one of
,
and
. Under the conditions
,
, k=0,1, and some regularity conditions, it has been shown that
is asymptotically normal with mean 0 and variance
(1) (e.g., Hu et al., Citation2014; Luo et al., Citation2017; Wang & Chen, Citation2009; Wang, Citation2007).
Our main result is about the asymptotic behaviour of with a
-consistent estimator
of
, which leads to a sufficient condition under which
and
are asymptotically equivalent. In the following,
denotes a column vector whose components are elements of a matrix B and
denotes a term converging to 0 in probability. A proof of the following theorem is given in the appendix.
Theorem 2.1
Assume
and the regularity conditions in the appendix.
If
is a
-consistent estimator of
then,
(2) where
(3)
If
(4) for some functions
with
then,
is asymptotically normal with mean 0 and variance
(5) where
is given by Equation (Equation1
(1) ) and
Condition (Equation4(4) ) is satisfied for some sufficient dimension reduction methods (Hsing & Carroll, Citation1992; Zhu & Ng, Citation1995).
Theorem 2.1 shows that the asymptotic difference between and
is related to the magnitude of
, k=0,1, through Equation (Equation2
(2) ). From the sufficient dimension reduction literature,
is at most of the order
. Hence, a sufficient condition under which
and
are asymptotically equivalent is that both
and
in Equation (Equation3
(3) ) are equal to 0. By formula (Equation3
(3) ), the only realistic situation where
is when
, which is implied by
.
Hence, if we choose satisfying both
and
, then
and
are asymptotically equivalent, provided that Equation (Equation4
(4) ) holds. However, we may pay a price for doing so, because
satisfying the additional requirement
may include some covariates superfluous for estimation and, thus, have an unnecessarily high dimension and lose efficiency. Let
with
satisfying
, and let
with
satisfying both
and
. Although
and
are asymptotically equivalent, their asymptotic variance is
given by Equation (Equation1
(1) ) with
replaced by
, which is larger than
when
is larger than dim
due to the extra requirement of
. Furthermore, even when
is less efficient than
due to the estimation of λ, it may still be more efficient than
. The following is an example for illustration.
Example 2.2
Let ,
,
, where
and
are independent and uniform on the interval
,
and is independent of X. Let
. Then
satisfying
, but not
. Let
. Then both
and
hold. However, dim
dim
, and
contains
that is not useful for estimating
. In this case,
, smaller than
The
vector defined by Equation (Equation3
(3) ) is a two-dimensional vector whose first component is 0 and second component
, so the asymptotic variance of
given by Equation (Equation5
(5) ) differs from
. Calculating the asymptotic variance in Equation (Equation5
(5) ) requires further information about
.
In next section, we provide some numerical results for the variance in Equation (Equation5(5) ).
3. Simulation
To support our theory we investigate the finite-sample performances of and
with two choices of
discussed in Section 2, i.e.,
satisfies
with smallest possible dim
, and
satisfies
and
with smallest possible dim
. We consider estimators using the true
and
as well as estimated
and
by applying the sliced inverse regression method (Li, Citation1991). According to Theorem 2.1, estimators using the true
and estimated
have different asymptotic variances, whereas estimators using the true
and estimated
are asymptotically equivalent. We try two sample sizes, n=200 and n=1000. As in Hu et al. (Citation2014), the nonparametric kernel estimators
and
are computed using the rth order Gaussian product kernel with standardised covariates. The bandwidth we used here is
(Chen, Wan, & Zhou, Citation2015; Hu et al., Citation2014).
We consider the following three simulation models.
with independent
components,
,
, where
's are independent
and are independent of X, and
. The outcome models are linear in X and the log-conditional treatment odds is linear in X. Under this model,
,
and dim
.
with independent
components,
,
, where
's are independent
and are independent of X, and
. The outcome models are linear in X and the log-conditional treatment odds is linear in X. Under this model,
but
.
with independent
components,
,
, where
's are independent
and are independent of X, and
. The outcome models are nonlinear in X and the log-conditional treatment odds is also nonlinear in X. Under this model,
but
.
Table shows the simulated relative bias and standard deviation in each scenario based on 1000 simulation runs. It can be seen that the simulation results are in agreement with the asymptotic result (Theorem 2.1), especially when n=1000, i.e., the SD of and
are very close while the SD of
and
may be quite different. Although
may be worse than
, it may be better than
; hence, it is not a good idea to search for a
that does not affect the asymptotic variance. Regarding the two different estimation methods,
and
have very comparable performances.
Table 1. Relative bias (RB) and standard deviation (SD) of
based on 1000 simulations.
Acknowledgments
We are grateful to the Editor, the Associate Editor and two anonymous referees for their insightful comments and suggestions on this article, which have led to significant improvements. The views in this publication are solely the responsibility of the authors and do not necessarily represent the views of the PCORI, its Board of Governors or Methodology Committee.
Disclosure statement
No potential conflict of interest was reported by the authors.
Notes on contributors
Ying Zhang is a PhD candidate, Department of Statistics, University of Wisconsin-Madison.
Dr Jun Shao holds a PhD in statistics from the University of Wisconsin-Madison. He is a professor of statistics at the University of Wisconsin-Madison. His research interests include variable selection and inference with high dimensional data, sample surveys, and missing data problems.
Dr Menggang Yu holds a PhD in biostatistics from the University of Michigan. He is now a professor of biostatistics at the University of Wisconsin-Madison. Besides developing statistical methodology related to cancer research and clinical trials, Dr Yu is also very interested in health services research.
Dr Lei Wang holds a PhD in statistics from East China Normal University. He is an assistant professor of statistics at Nankai University. His research interests include empirical likelihood and missing data problems.
Additional information
Funding
References
- Adichie, J. N. (1974). Rank score comparison of several regression parameters. The Annals of Statistics, 2, 396–402. doi: 10.1214/aos/1176342676
- Chen, X., Wan, A. T. K., & Zhou, Y. (2015). Efficient quantile regression analysis with missing observations. Journal of the American Statistical Association, 110, 723–741. doi: 10.1080/01621459.2014.928219
- Cook, R. D., & Weisberg, S. (1991). Discussion of ‘Sliced inverse regression for dimension reduction’. Journal of the American Statistical Association, 86, 328–332.
- Deng, J., & Wang, Q. (2017). Dimension reduction estimation for probability density with data missing at random when covariables are present. Journal of Statistical Planning and Inference, 181, 11–29. doi: 10.1016/j.jspi.2016.08.007
- De Wet, T., & Van Wyk, J. W. J. (1979). Efficiency and robustness of Hogg's adaptive trimmed means. Communications in Statistics: Theory and Methods, 8, 117–128. doi: 10.1080/03610927908827743
- Gong, G., & Samaniego, F. J. (1981). Pseudo maximum likelihood estimation: Theory and applications. The Annals of Statistics, 9, 861–869. doi: 10.1214/aos/1176345526
- Hahn, J. (1998). On the role of the propensity score in efficient semiparametric estimation of average treatment effects. Econometrica, 66, 315–331. doi: 10.2307/2998560
- Hahn, J. (2004). Functional restriction and efficiency in causal inference. Review of Economics and Statistics, 86, 73–76. doi: 10.1162/003465304323023688
- Hirano, K., Imbens, G. W., & Ridder, G. (2003). Efficient estimation of average treatment effects using the estimated propensity score. Econometrica, 71, 1161–1189. doi: 10.1111/1468-0262.00442
- Hsing, T., & Carroll, R. J. (1992). An asymptotic theory for sliced inverse regression. The Annals of Statistics, 20, 1040–1061. doi: 10.1214/aos/1176348669
- Hu, Z., Follmann, D. A., & Wang, N. (2014). Estimation of mean response via the effective balancing score. Biometrika, 101, 613–624. doi: 10.1093/biomet/asu022
- Imai, K., & Ratkovic, M. (2014). Covariate balancing propensity score. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76, 243–263. doi: 10.1111/rssb.12027
- Imbens, G. W. (2004). Nonparametric estimation of average treatment effects under exogeneity: A review. Review of Economics and Statistics, 86, 4–29. doi: 10.1162/003465304323023651
- Imbens, G. W., Newey, W., & Ridder, G. (2006). Mean-squared-error calculations for average treatment effects (Working Paper).
- Kang, J. D. Y., & Schafer, J. L. (2007). Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Statistical Science, 22, 523–539. doi: 10.1214/07-STS227
- Li, K. C. (1991). Sliced inverse regression for dimension reduction. Journal of the American Statistical Association, 86, 316–327. doi: 10.1080/01621459.1991.10475035
- Luo, W., Zhu, Y., & Ghosh, D. (2017). On estimating regression-based causal effects using sufficient dimension reduction. Biometrika, 104, 51–65.
- Raghavachari, M. (1965). On the efficiency of the normal scores test relative to the F-test. The Annals of Mathematical Statistics, 36, 1306–1307. doi: 10.1214/aoms/1177700005
- Randles, R. H. (1982). On the asymptotic normality of statistics with estimated parameters. The Annals of Statistics, 10, 462–474. doi: 10.1214/aos/1176345787
- Rosenbaum, P. R., & Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70, 41–55. doi: 10.1093/biomet/70.1.41
- Wang, D., & Chen, S. X. (2009). Empirical likelihood for estimating equations with missing values. The Annals of Statistics, 37, 490–517. doi: 10.1214/07-AOS585
- Wang, Q. (2007). M-estimators based on inverse probability weighted estimating equations with response missing at random. Communications in Statistics: Theory and Methods, 36, 1091–1103. doi: 10.1080/03610920601076917
- Xia, Y., Tong, H., Li, W. K., & Zhu, L. X. (2002). An adaptive estimation of dimension reduction space. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 64, 363–410. doi: 10.1111/1467-9868.03411
- Zhu, L. X., & Ng, K. W. (1995). Asymptotics of sliced inverse regression. Statistica Sinica, 5, 727–736.
Appendix
The following regularity conditions are assumed for Theorem 2.1, where conditions (1)–(4) are the same as conditions C1–C5 in Wang & Chen (Citation2009) with X replaced by , k=0,1:
is bounded away from 0 and 1.
The propensity function
, the
-density function
and
all have bounded partial derivatives with respect to
up to order
with
, where
is the order of the kernel
.
.
The smoothing bandwidth
satisfies
and
as
.
The kernel
is bounded up to the second-order derivative.
The smoothing bandwidth
satisfies
as
.
Proof of Theorem 2.1
For purposes of simplicity, we focus only on the proof for regression type estimator with
and show the difference of first term in regression estimator between using true
and estimated
. Denote
,
,
,
,
as h,
,
,
, B, respectively, and define
in the following proof. Let
; it can be verified that
where
Using a Taylor expansion around
for
and plugging in
, we have
Denote
and
. Simple calculation entails that
and
Therefore,
where
It can be seen that the first term in
will be equal to 0 if
, while the second term in
will be equal to 0 if
. Thus, it leads to
when both
and
hold.
Let and
. We have
Thus,
, which leads to
For
, we also use a Taylor expansion for
:
We then decompose
by conditioning on indexes i, j, that is, we define
Since
using a similar decomposition method as
, we can also show
and
. Theorem 2.1 is proved.