693
Views
1
CrossRef citations to date
0
Altmetric
Articles

Quantile treatment effect estimation with dimension reduction

, , &
Pages 202-213 | Received 19 Aug 2019, Accepted 20 Nov 2019, Published online: 28 Nov 2019

Abstract

Quantile treatment effects can be important causal estimands in evaluation of biomedical treatments or interventions for health outcomes such as medical cost and utilisation. We consider their estimation in observational studies with many possible covariates under the assumption that treatment and potential outcomes are independent conditional on all covariates. To obtain valid and efficient treatment effect estimators, we replace the set of all covariates by lower dimensional sets for estimation of the quantiles of potential outcomes. These lower dimensional sets are obtained using sufficient dimension reduction tools and are outcome specific. We justify our choice from efficiency point of view. We prove the asymptotic normality of our estimators and our theory is complemented by some simulation results and an application to data from the University of Wisconsin Health Accountable Care Organization.

1. Introduction

Causal evaluation of treatment or intervention is commonly done by estimating average treatment effect (ATE). However for health outcomes such as medical cost and utilisation, quantile treatment effect (QTE) may be more relevant and informative (Abadie, Angrist, & Imbens, Citation2002; Cattaneo, Citation2010; Chernozhukov & Hansen, Citation2005; Doksum, Citation1974; Firpo, Citation2007; Frölich & Melly, Citation2010Citation2013; Lehman, Citation1975). As outcomes tend to be highly skewed to the right, ATE may not be a proper representative parameter for location. Furthermore, it is often important to learn about distributional impacts beyond ATE, such as the effects on upper (or lower) quantiles of an outcome, which may be of direct interests to policy makers and other stakeholders of a programme.

Our study of QTE is motivated by the following investigation at the University of Wisconsin (UW) Health System. As of January 1, 2013, the UW Health System became an Accountable Care Organization (ACO), which is a network of doctors, clinics and other health care providers that share financial and medical responsibility for providing coordinated care to patients in hopes of limiting unnecessary spending. One strategy pursued by nearly all ACOs is to manage the care to ‘high-need, high-cost’ patients: those with multiple or complex conditions, often combined with behavioural health problems or socioeconomic challenges. In particular, we are asked to evaluate a particular intervention used by the UW Health System. If the intervention can reduce the upper quantiles of health care utilisation quantified by medical cost, then the next step is to significantly enhance the nurse team so that intervention can be extended to a wider population. In essence, we need to estimate QTEs particularly at an upper level.

To define QTE, we begin with some notation. Let T be a binary treatment indicator, X be a p-dimensional vector of pretreatment covariates, and Y0 and Y1 be the potential outcomes under treatments T = 0 and T = 1, respectively. Since only one treatment is applied, either Y1 or Y0 is observed, but not both, i.e. what we observe is Z=TY1+(1T)Y0. With a fixed τ(0,1), the 100τ% QTE is defined as θ=q1,τq0,τ, where qk,τ is the τth quantile of Yk, k = 0, 1; e.g. τ=0.5, 0.25, and 0.75 give the difference of medians, lower quartiles, and upper quartiles, respectively. We focus on the estimation of θ based on a random sample {Zi,Xi,Ti:i=1,,n} of n observations from (Z,X,T).

Because we only observe Z, θ is often not identifiable without any condition. Throughout we assume the following assumption that is believed to be reasonable in many applications (Rosenbaum & Rubin, Citation1983): T(Y0,Y1)|X, i.e. T and the vector of potential outcomes (Y0,Y1) are independent conditional on X, which is similar to the ignorable missingness assumption when we treat T as a missingness indicator and unobserved Y0 or Y1 as a missing value. Under this assumption, two types of consistent estimators of QTE θ in causal inference or closely related context in missing data have been proposed in the literature. One type is derived through regression on (T=k,X) for k = 0, 1 (Cattaneo, Citation2010; Chen, Wan, & Zhou, Citation2015; Cheng & Chu, Citation1996; Zhou, Wan, & Wang, Citation2008), and the other type is based on inverse propensity weighting with propensity score P(T=1|X) (Firpo, Citation2007). A review is given by Cattaneo, Drukker, and Holland (Citation2013). Since parametric methods rely on correct model specifications, nonparametric estimation of the regression functions or propensity is often preferred and therefore considered in what follows.

In our ACO data, however, the dimension p of X is high and nonparametric estimation of regression or propensity function using for example the kernel method is asymptotically inefficient when Yk is related with only a lower dimensional function of X. Unnecessarily using a high dimensional X may also affect kernel estimation numerically. Our main task is studying covariate dimension reduction to facilitate stable and efficient estimation of QTE.

If inverse propensity weighting is applied, it seems that covariate dimension reduction is to find a linear function ST of X with the smallest dimension such that TX|ST. Unfortunately, Hahn (Citation1998) indicated that in the estimation of ATE, using such an ST provides no improvement in estimation efficiency over using the entire X. Because the outcome (Y1,Y0) is involved in the estimation of ATE, Hahn (Citation2004) suggested finding a linear function SY0,Y1 of X with the smallest possible dimension such that (Y0,Y1)X|SY0,Y1, which also implies T(Y0,Y1)|SY0,Y1. The resulting ATE estimator is asymptotically more efficient than the estimator using the entire X unless SY0,Y1=X. De Luna, Waernbaum, and Richardson (Citation2011) further considered an Smin which removes the components in SY0,Y1 that are unrelated to T. This Smin is the smallest dimensional SSY0,Y1 that satisfies TSY0,Y1|S, which also implies T(Y0,Y1)|Smin. However, it is proved in the Appendix that the asymptotic variance using Smin is larger than that of SY0,Y1 unless Smin=SY0,Y1; see also Brookhart et al. (Citation2006), Shortreed and Ertefaie (Citation2017).

Note that the estimation of θ=q1,τq0,τ can be done by estimating q0,τ and q1,τ separately and then taking the difference. If a linear function SYk of X satisfies YkX|SYk and has the smallest dimension, then SYk has a dimension no larger than that of SY0,Y1 defined in Hahn (Citation2004), k = 0, 1. Hence, our approach alleviates the curse of dimensionality more and it produces asymptotically more efficient estimator of θ.

In applications, SY0 and SY1 have to be estimated using observed data. We adopt the existing nonparametric sufficient dimension reduction methods (Cook & Weisberg, Citation1991; Li, Citation1991; Xia, Tong, Li, & Zhu, Citation2002) to construct estimators SˆYk of SYk. We establish the asymptotic normality for our estimator of θ based on SˆY0 and SˆY1, and compare its efficiency with an asymptotic efficiency bound. We also compare the performances of various estimators in simulation studies and apply our method to the medical cost data from the UW Health System.

2. Methods

Without dimension reduction, three types of nonparametric estimators for θ have been proposed in the literature. The inverse propensity weighting (IPW) method (Firpo, Citation2007) is a weighed version of the procedure in Koenker and Bassett (Citation1978) for the quantile estimation problem.

The regression (REG) method (Cattaneo, Citation2010; Chen et al., Citation2015) estimates the function mk(x,t)=E{ρ(Yk,t)|X=x}=E{ρ(Z,t)|T=k,X=x} by mˆk(x,t) using a nonparametric method and data under T = k for k = 0, 1 separately, where ρ(s,t)=(st)(τ1{st}) is the check function (Koenker & Bassett, Citation1978) and 1{} is the indicator function. Finally, Cattaneo et al. (Citation2013) and Chen et al. (Citation2015) combined IPW and REG to obtain the so-called augmented inverse propensity weighting (AIPW) estimator.

For each k, let SYk=BkX with YkX|SYk, where Bk denotes the transpose of a p×dk deterministic matrix with the smallest possible dk, k = 0, 1. As a consequence of Theorem 2.1 stated below, estimators using SYk as covariate sets are asymptotically more efficient than those using X as covariate set when dk<p (if dk=p, then SYk=X). In the estimation of ATE, Hahn (Citation2004) recommended to replace X by SY0,Y1, but the dimension of SY0,Y1 is no smaller than that of SYk, which leads to efficiency loss as a consequence of Theorem 2.1.

In applications, SYk has to be estimated by SˆYk=BˆkX, and we adopt a nonparametric sufficient dimension reduction method to construct Bˆk (Cook & Weisberg, Citation1991; Li, Citation1991; Ma and Zhu, 2012; Xia et al., Citation2002). Since the distribution of Yk|X is the same as Z|X,T=k, we separately estimate SYk using the observed data (Zi,Xi) in group T = k. To estimate the dimensions of SY0 and SY1, we adopt consistent criteria such as BIC-type criteria introduced by Zhu, Zhu, and Feng (Citation2010) and bootstrap based criteria.

Let SˆYk,i=BˆkXi, i=1,,n, k = 0, 1. In our IPW method, we estimate the propensity πk(sk)=P(T=k|SYk=sk) by πˆk(sk) using a nonparametric method for k=0,1 separately. The IPW estimator of θ is θˆIPW=qˆ1,τIPWqˆ0,τIPW, where (1) qˆk,τIPW=argminti=1nTi(k)ρ(Zi,t)πˆk(SˆYk,i),k=0,1,(1) and Ti(1)=Ti, Ti(0)=1Ti.

The REG method estimates mk(sk,t)=E{ρ(Yk,t)|SYk=sk} by mˆk(sk,t) using a nonparametric method for k = 0, 1 separately, and estimates θ by θˆREG=qˆ1,τREGqˆ0,τREG, where (2) qˆk,τREG=argminti=1nmˆk(SˆYk,i,t),k=0,1.(2) We can combine IPW and REG to obtain our AIPW estimator, θˆAIPW=qˆ1,τAIPWqˆ0,τAIPW, where (3) qˆk,τAIPW=argminti=1nTi(k)ρ(Zi,t)πˆk(SˆYk,i)Ti(k)πˆk(SˆYk,i)πˆk(SˆYk,i)mˆk(SˆYk,i,t),k=0,1.(3) To estimate mk(sk,t) and πk(sk) in (Equation1)–(Equation3), we use the nonparametric kernel estimators (Silverman, Citation1986): mˆk(sk,t)=i=1nTi(k)ρ(Zi,t)KHk(SˆYk,isk)i=1nTi(k)KHk(SˆYk,isk),πˆk(sk)=i=1nTi(k)KHk(SˆYk,isk)i=1nKHk(SˆYk,isk),k=0,1, where KHk(sk)=det(Hk1)Kk(Hk1sk), Kk() is a dk-dimensional kernel function, dk is the dimension of SˆYk, and Hk is the bandwidth matrix. When SˆYk is standardised, we consider Hk=hknIdk with scalar bandwidth hkn and identity matrix Idk (Hardle, Muller, Sperlich, & Werwatz, Citation2004). As in Hu, Follmann, and Wang (Citation2014), the nonparametric kernel estimators are computed using the rth order Gaussian product kernel with standardised covariates. The bandwidth we used here is hkn=Cn2/(2rk+dk), where rk is the order of Kk, k = 0, 1. To determine the constant C we adopt the J-fold cross validation, i.e. we select C that minimises j=1J(θˆθˆj)2, where J is the total number of folds and θˆj is the estimator of θ with all data but not those in the jth fold, j=1,,J. We use J = 10 in our simulations in Section 3.

The following theorem establishes the asymptotic normality of estimators in (Equation1)–(Equation3) and assesses the efficiency of estimators.

Theorem 2.1

Assume the conditions stated in the Appendix. Let θˆ(S0,S1) be one of θˆIPW, θˆREG, and θˆAIPW in (Equation1)–(Equation3) with SˆYk replaced by Sk=BkX satisfying YkXSk, k = 0, 1, and let θˆ(Sˆ0,Sˆ1) be the same estimator with Sk replaced by its estimator Sˆk=BˆkX, where nvec(BˆkBk)=n1/2i=1nψk(Xi,Zi,Ti)+op(1) for some functions ψk with E(ψk(X,Z,T))=0, k = 0, 1, vec(M) is a column vector whose components are elements of a matrix M, and op(1) denotes a quantity converging to 0 in probability. Then we have the following conclusions.

(i) n{θˆ(S0,S1)θ} is asymptotically normal with mean 0 and variance (4) VS0,S1=varE(g1(Y1)|S1)E(g0(Y0)|S0)+k=0,1Evar(gk(Yk)|Sk)P(T=k|Sk),(4) where gk(Yk)=(1{Ykqk,τ}τ)/fk(qk,τ) and fk is the p.d.f. of Yk, k=0,1.

(ii) n{θˆ(Sˆ0,Sˆ1)θ} is asymptotically normal with mean 0 and variance (5) VS0,S1=VS0,S1+vark=0,1ckψk(X,Z,T)+2covk=0,1ckψk(X,Z,T),S(X,Z,T),(5) where ck=vecEcov(X,T|Sk)πk(Sk)E(gk(Yk)|Sk)Sk, and S(X,Z,T)=k=0,1(1)(k1)T(k)πk(Sk){gk(Yk)E(gk(Yk)|Sk)}+E(gk(Yk)|Sk)T(k)πk(Sk){gk(Yk).

Theorem 2.1(i) justifies our choice of Sk=SYk. VS0,S1 in (EquationA1) is in fact the semiparametric efficiency bound of estimating θ following the ideas in Bickel, Klaassen, Ritov, and Wellner (Citation1993), Hahn (Citation1998) and Firpo (Citation2007). Details can be found in Lemma A.1 in the Appendix. However, in practice, the IPW estimator may not have enough estimation efficiency, as it does not fully extract the information contained in the auxiliary variables. While, the REG and AIPW estimators use all observed covariates to improve estimation efficiency.

By (EquationA1) and Jensen's inequality, among all linear functions (S0,S1) satisfying YkX|Sk, k=0,1, VS0,S1 is minimised when Sk has the smallest possible dimension, i.e. Sk=SYk, k = 0, 1. In particular, this applies to S0=S1=SY0,Y1 proposed in Hahn (Citation2004), since the dimension of SY0,Y1 is no smaller than that of SYk.

The sum of last two terms on the right hand side of (Equation5) quantifies the price we may pay for estimating Bk by Bˆk. There is an efficiency loss due to estimating SYk by SˆYk when this sum is positive, while it is possible that this sum is negative so that we have an efficiency gain. If we further include the covariates related to T, i.e. consider SYk,T being the smallest possible dimensional Sk satisfying TX|Sk and YkX|Sk, k = 0, 1, then cov(X,T|Sk)=0 and ck=0, hence, θˆ(SˆY0,T,SˆY1,T) and θˆ(SY0,T,SY1,T) are asymptotically equivalent. However, it is generally not a good idea to use (SˆY0,T,SˆY1,T), because each SYk,T has a dimension no smaller than that of SYk and therefore both θˆ(SˆY0,T,SˆY1,T) and θˆ(SY0,T,SY1,T) is less efficient than θˆ(SY0,SY1) according to Theorem 2.1. Although θˆ(SˆY0,SˆY1) may be less efficient than θˆ(SY0,SY1) due to the estimation of SYk, it may still be more efficient than θˆ(SˆY0,T,SˆY1,T). Some simulation results are given in Section 3.

In Theorem 2.1, the condition nvec(BˆkBk)=n1/2i=1nψk(Xi,Zi,Ti)+op(1) with Eψk(X,Z,T)=0 is satisfied for Bˆk obtained using some sufficient dimension reduction methods (Hsing & Carroll, Citation1992; Zhu & Ng, Citation1995).

3. Simulation

We investigate the finite-sample performance of three estimators, θˆIPW, θˆREG, and θˆAIPW, with four choices of linear functions (S0,S1), (1) Sk=SYk, k = 0, 1, (2) S0=S1=SY0,Y1, (3) Sk=SYk,T, k = 0, 1, and (4) S0=S1=ST. For each choice of (S0,S1), we consider estimators using the true (S0,S1) as well as (Sˆ0,Sˆ1) by sufficient dimension reduction. Thus, we consider a total of 3×4×2=24 cases. In each case, we consider the estimation of the QTEs with τ=25%, 50%, and 75%, under two different sample sizes n = 200 and n = 1000.

In the first simulation, X=(X1,,X7) with independent N(0,1) components, P(T=1|X)=exp(2X4){1+exp(2X4)}1, Y0=3X1+6X2+3X3+ϵ0, and Y1=10+3X1+6X2+3X3+3X4+ϵ1, where ϵk's are independent N(0,1) and are independent of X. The outcome models are linear in X, the treatment model is logistic, and the log-conditional treatment odds is linear in X. Under this model, SY0, SY1, and ST are all one-dimensional, while SY0,Y1=SY1,T=SY0,T is two-dimensional.

In the second simulation, X=(X1,,X7) with independent N(0,1) components, P(T=1|X)=exp(2X5+0.7X620.5X72){1+exp(2X5+0.7X620.5X72)}1, Y0=3(X1+X2+2X3+2X4)+1.5X62+ϵ0, and Y1=12+3(X1+X2+2X3+X4+X5)+1.5X72+ϵ1, where ϵk's are independent N(0,1) and are independent of X. The outcome models are nonlinear in X, the treatment model is logistic, and the log-conditional treatment odds is nonlinear in X. Under this setting, each SYk is two-dimensional, ST is three-dimensional, while SY0,Y1, SY1,T, and SY0,T are four-dimensional and not the same.

Based on 1000 simulation runs, we calculate the simulated relative bias (RB) and standard deviation (SD) in each scenario. The results for simulations are given in Tables , respectively. The following conclusions can be obtained from the simulation results in Tables .

  1. When the true (S0,S1) is used, (SY0,SY1) leads to the best performance overall, followed by SY0,Y1, (SY0,T,SY1,T), and ST, in agreement with our asymptotic results discussed in Section 2 and proved in the Appendix.

  2. When estimator (SˆY0,SˆY1) is used, the resulting estimators of θ are in general less efficient than those based on the true (SY0,SY1), but they are still better than the estimators based on other choices of (S0,S1) regardless of whether the true or estimated (S0,S1) is used.

  3. The performances of estimators using the true (SY0,T,SY1,T) and (SˆY0,T,SˆY1,T) are quite similar when n = 1000, in agreement with the asymptotic results in Theorem 2.1 and our discussion after Theorem 2.1. They are worse than those using (SˆY0,SˆY1).

  4. Consistent with the asymptotic theory, the performance of estimators using ST is the worst, and the efficiency loss is substantial in most cases. Note that using estimated ST is actually better than using the true ST.

  5. Regarding the three different estimation methods, θˆREG and θˆAIPW have very comparable performances and are recommended in practice.

Table 1. Relative bias and standard deviation for simulation 1 with true or estimated S0 and S1.

Table 2. Relative bias and standard deviation for simulation 2 with true or estimated S0 and S1.

4. Real data analysis

As we mentioned in the introduction, the University of Wisconsin Health System became an Accountable Care Organization (ACO) and implemented a Complex Care Management (CCM) programme since January 1, 2013. In particular, a team of nurses take responsibility for coordinating and implementing complex patients' care plan. The CCM is very intensive in time and resources and therefore it is important to evaluate its specific value.

We demonstrate the proposed estimation methods in a data set resulted from the University of Wisconsin Health ACO study where the primary outcome Z is the annualised payment amount in thousands. The data set consists of 894 patients with 186 in the CCM group (T = 1) and 708 not in the CCM group (T = 0).

Two issues with this dataset actually motivated our study. First, the distribution of annualised payment is right-skewed as shown by the box plots in Figure  for all patients and two groups. The overall median, mean, 75% quantile, and maximum of observed payment are about 13, 31, 41, and 376 thousand dollars, respectively. This suggests the need for estimating quantile treatment effects. Second, the dataset consists of three discrete and ninety-four continuous covariates including medicare status, baseline payments, as well as other baseline characteristics of patients. Thus, dimension reduction is needed in nonparametric kernel estimation.

Figure 1. Boxplots of observed annualised payment amount (in thousands) for overall, CCM group T = 1, and non-CCM group T = 0.

Figure 1. Boxplots of observed annualised payment amount (in thousands) for overall, CCM group T = 1, and non-CCM group T = 0.

For sufficient dimension reduction, we adopt the semiparametric directional regression method proposed by Ma and Zhu (2012). After sufficient dimension reduction, ST has 7 dimensions, SY0 has 5 dimensions, SY1 has 8 dimensions, and SY0,Y1 has 13 dimensions.

Results for three choices of (S0,S1) considered in simulation are shown in Table  for estimating ATE and QTE with τ=25%, 50%, and 75%. Standard errors (SE) for all estimates are calculated using the bootstrap with 200 replications.

Table 3. Estimates and standard errors (SE) for the University of Wisconsin Health ACO data.

From Table , the 25% and 50% QTEs are not significant by all methods. When SYk or SY0,Y1 is used, the 75% QTE is significantly less than 0, and in terms of SE, the estimate using SYk, k = 0, 1, is more efficient than the estimate using SY0,Y1. However, the estimate of 75% QTE using ST is inefficient due to the large variation of using ST so that the result is insignificant.

Since 75% QTE is significantly negative, the result indicates that receiving CCM intervention effectively helps reducing medical payment for the high-cost patients. But CCM intervention is not so useful for the low-cost or median-cost patients, as 25% and 50% QTEs are not significant. These results may be useful for decision making in ACO.

For comparison, we also include estimates of ATE and SE. The results in the last block of Table  show that ATE is not significant by all methods. It is interesting to see that estimates of ATE are all negative whereas estimates of 50% QTE are all positive although they are not significant, which may be caused by fact that the distribution of annualised payment is right-skewed.

The example shows the usefulness of assessing QTEs with different percentages. If we only estimate ATE, no useful conclusion can be made in this example. Even if we check 50% QTE instead of ATE because of the existing skewness, we still cannot obtain any useful conclusion.

Acknowledgments

We are grateful to the editor, the associate editor, and two anonymous referees for their insightful comments and suggestions, which have led to significant improvements.

Disclosure statement

No potential conflict of interest was reported by the authors.

Additional information

Funding

Our research was supported by the National Natural Science Foundation of China (11871287, 11831008), the Natural Science Foundation of Tianjin (18JCYBJC41100), the Fundamental Research Funds for the Central Universities, the Key Laboratory for Medical Data Analysis and Statistical Research of Tianjin, the Chinese 111 Project (B14019), the U.S. National Science Foundation (DMS-1612873 and DMS-1914411). This research was partially supported through a Patient-Centered Outcomes Research Institute (PCORI) Award (ME-1409-21219).

Notes on contributors

Ying Zhang

Ying Zhang is a Ph.D. candidate, Department of Statistics, University of Wisconsin-Madison.

Lei Wang

Dr Lei Wang holds a Ph.D. in statistics from East China Normal University. He is an assistant professor of statistics at Nankai University. His research interests include empirical likelihood and missing data problems.

Menggang Yu

Dr Menggang Yu holds a Ph.D. in biostatistics from the University of Michigan. He is now a professor of biostatistics at the University of Wisconsin-Madison. Besides developing statistical methodology related to cancer research and clinical trials, Dr Yu is also very interested in health services research.

Jun Shao

Dr Jun Shao holds a Ph.D. in statistics from the University of Wisconsin-Madison. He is a professor of statistics at the University of Wisconsin-Madison. His research interests include variable selection and inference with high dimensional data, sample surveys, and missing data problems.

References

  • Abadie, A., Angrist, J., & Imbens, G. W. (2002). Instrumental variables estimates of the effect of subsidized training on the quantiles of trainee earnings. Econometrica, 70, 91–117. doi: 10.1111/1468-0262.00270
  • Bickel, P. J., Klaassen, C. J., Ritov, Y., & Wellner, J. (1993). Efficient and adaptive inference in semiparametric models. Baltimore: Johns Hopkins University Press.
  • Brookhart, M., Schneeweiss, S., Rothman, K., Glynn, R., Avorn, J., & Sturmer, T. (2006). Variable selection for propensity score models. American Journal of Epidemiology, 163, 1149–1156. doi: 10.1093/aje/kwj149
  • Cattaneo, M. D. (2010). Efficient semiparametric estimation of multi-valued treatment effects under ignorability. Journal of Econometrics, 155, 138–154. doi: 10.1016/j.jeconom.2009.09.023
  • Cattaneo, M. D., Drukker, D. M., & Holland, A. D. (2013). Estimation of multivalued treatment effects under conditional independence. The Stata Journal, 13, 407–450. doi: 10.1177/1536867X1301300301
  • Chen, X., Wan, A. T. K., & Zhou, Y. (2015). Efficient quantile regression analysis with missing observations. Journal of the American Statistical Association, 110, 723–741. doi: 10.1080/01621459.2014.928219
  • Cheng, P. E., & Chu, C. (1996). Kernel estiamation of distribution functions and quantiles with missing data. Statistica Sinica, 6, 63–78.
  • Chernozhukov, V., & Hansen, C. (2005). An IV model of quantile treatment effects. Econometrica, 73, 245–261. doi: 10.1111/j.1468-0262.2005.00570.x
  • Cook, R. D., & Weisberg, S. (1991). Discussion of ‘Sliced inverse regression for dimension reduction’. Journal of the American Statistical Association, 86, 328–332.
  • De Luna, X., Waernbaum, I., & Richardson, T. S. (2011). Covariate selection for the nonparametric estimation of an average treatment effect. Biometrika, 98, 861–875. doi: 10.1093/biomet/asr041
  • Doksum, K. (1974). Empirical probability plots and statistical inference for nonlinear models in the two-sample case. The Annals of Statistics, 2, 267–277. doi: 10.1214/aos/1176342662
  • Firpo, S. (2007). Efficient semiparametric estimation of quantile treatment effects. Econometrica, 75, 259–276. doi: 10.1111/j.1468-0262.2007.00738.x
  • Frölich, M., & Melly, B. (2010). Estimation of quantile treatment effects with Stata. The Stata Journal, 10, 423–457. doi: 10.1177/1536867X1001000309
  • Frölich, M., & Melly, B. (2013). Unconditional quantile treatment effects under endogeneity. Journal of Business and Economic Statistics, 31, 346–357. doi: 10.1080/07350015.2013.803869
  • Hahn, J. (1998). On the role of the propensity score in efficient semiparametric estimation of average treatment effects. Econometrica, 66, 315–331. doi: 10.2307/2998560
  • Hahn, J. (2004). Functional restriction and efficiency in causal inference. Review of Economics and Statistics, 86, 73–76. doi: 10.1162/003465304323023688
  • Hardle, W., Muller, M., Sperlich, S., & Werwatz, A. (2004). Nonparametric and semiparametric models. Heidelberg: Springer-Verlag.
  • Hsing, T., & Carroll, R. J. (1992). An asymptotic theory for sliced inverse regression. The Annals of Statistics, 20, 1040–1061. doi: 10.1214/aos/1176348669
  • Hu, Z., Follmann, D. A., & Wang, N. (2014). Estimation of mean response via the effective balancing score. Biometrika, 101, 613–624. doi: 10.1093/biomet/asu022
  • Koenker, R., & Bassett, G. (1978). Regression quantiles. Econometrica, 46, 33–50. doi: 10.2307/1913643
  • Lehman, E. L. (1975). Nonparametrics: Statistical methods based on ranks. San Francisco: Holden-Day.
  • Li, K. C. (1991). Sliced inverse regression for dimension reduction. Journal of the American Statistical Association, 86, 316–327. doi: 10.1080/01621459.1991.10475035
  • Rosenbaum, P. R., & Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70, 41–55. doi: 10.1093/biomet/70.1.41
  • Shortreed, S. M., & Ertefaie, A. (2017). Outcome-adaptive Lasso: Variable selection for causal inference. Biometrics, 73, 1111–1122. doi: 10.1111/biom.12679
  • Silverman, B. W. (1986). Density estimation for statistics and data analysis. London: Chapman and Hall.
  • Xia, Y., Tong, H., Li, W. K., & Zhu, L. X. (2002). An adaptive estimation of dimension reduction space. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 64, 363–410. doi: 10.1111/1467-9868.03411
  • Zhou, Y., Wan, A. T. K., & Wang, X. (2008). Estimating equations inference with missing data. Journal of the American Statistical Association, 103, 1187–1199. doi: 10.1198/016214508000000535
  • Zhu, L. X., & Ng, K. W. (1995). Asymptotics of sliced inverse regression. Statistica Sinica, 5, 727–736.
  • Zhu, L. P., Zhu, L. X., & Feng, Z. H. (2010). Dimension reduction in regressions through cumulative slicing estimation. Journal of the American Statistical Association, 105, 1455–1466. doi: 10.1198/jasa.2010.tm09666

Appendices

Appendix 1. Semiparametric efficiency bound of estimating θ with Sk

Throughout the Appendix, the S, Sk, SYk, SY0,Y1, ST, Smin are linear functions of X, i.e. S=BX with B being a p×d matrix, Sk=BkX with Bk being a p×dk matrix, etc.

Lemma A.1

Assume T(Y0,Y1)|X and YkX|Sk, and the distribution of Yk has density fk with fk(qk,τ)>0, k = 0, 1. A lower bound for the asymptotic variance of any asymptotically normal estimator of θ=q1,τq0,τ is given by (A1) VS0,S1=varE(g1(Y1)|S1)E(g0(Y0)|S0)+k=0,1Evar(gk(Yk)|Sk)P(T=k|Sk),(A1) where gk(Yk)=(1{Ykqk,τ}τ)/fk(qk,τ), k = 0, 1. If YkX|Sk, YkX|Sk, and L(Sk)L(Sk), k=0,1, then VS0,S1VS0,S1, where L(S) denotes the linear space generated by columns of B for S=BX.

Proof

Proof of Lemma A.1

Our derivation of the efficiency bound mimics the proof in Firpo (Citation2007) which is a direct application of the semiparametric efficiency theory from Bickel et al. (Citation1993). Following the proof of Firpo (Citation2007), one may easily see that knowing T|X=T|ST won't change the semiparametric efficiency bound, which is similar with the ATE case in Hahn (Citation1998). In our proof for Lemma A.1, one only needs to carefully keep S1 and S0 separate in the derivation. The construction of the efficient influence function is more involved algebraically. We only provide a sketch of the proof for the case Sk=SYk here. The density of (Y0,Y1,T,X) at (y0,y1,k,x) is q(y0,y1,k,x)=g(y0,y1|x)π(x)k{1π(x)}1kf(x), where g(y0,y1|x) denotes the conditional distribution of (Y0,Y1) given X, f(x) denotes the marginal distribution of X and π(x)=P(T=1|X=x). The density of (Z,T,X) at (z,k,x) is then equal to q(z,k,x)={g1(z|x)π(x)}k{g0(z|x)(1π(x))}1kf(x)={h1(z|Sy1)π(x)}k{h0(z|Sy0)(1π(x))}1kf(x), where g1(|x)=g(y0,|x)dy0, g0(|x)=g(,y1|x)dy1. The second equality holds because by the definition of SY1,SY0, there exist functions h1 and h0 that g1(|x)=h1(|Sy1) and g0(|x)=h0(|Sy0). For a regular parametric submodel q(z,k,x) with parameter w, qω(z,k,x)={h1(z|Sy1,ω)π(x,ω)}k×{h0(z|Sy0,ω)(1π(x,ω))}1kf(x,ω). The score function of this parametric submodel is s(z,k,x|ω)=ks1(z|Sy1,ω)+(1k)s0(z|Sy0,ω)+{kπ(x,ω)}ωπ(x,ω)π(x,ω){1π(x,ω)}+d(x,ω), where d(x,ω)=f(x,ω)1(f(x,ω)/ω), s1(z|Sy1,ω)=h1(z|Sy1,ω)1(h1(z|Sy1,ω)/ω), s0(z|Sy0,ω)=h0(z|Sy0,ω)1(h0(z|Sy0,ω)/ω). Therefore, the tangent space is equal to T=d(x)f(x)dx=0 and a(x)isunrestrictedks1(z|Sy1)+(1k)s0(z|Sy0)+a(x)(kπ(x))+d(x):where (s0,s1,d,a)satisfiessj(z|Syj)hj(z|Syj)dy=0,×d(x)f(x)dx=0 and a(x)is unrestricted. For the parametric submodel with parameter ω under consideration, qk,τ(ω), the τ-th quantile for Yk, k = 0, 1, satisfies 0=Eω(1{Ykqk,τ(ω)}τ)=(1{zqk,τ(ω)}τ)hk(z|SYk,ω)dzf(x,ω)dx. Let θ(ω)=q1,τ(ω)q0,τ(ω), and remember gk(Yk)=(1{Ykqk,τ}τ)/fk(qk,τ), k = 0, 1. By an application of Leibnitz's rule, θ(ω)ω=g1(z)s1(z|Sy1,ω)h1(z|Sy1,ω)f(x,ω)dzdx+Eωg1(z)g0(z)|X=xd(x,ω)f(x,ω)dxg0(z)s0(z|Sy0,ω0)h0(z|Sy0)f(x,ω)dzdx. Let F(Z,T,X)=T{g1(Z)E(g1(Z)|SY1)}P(T=1|SY1)(1T){g0(Z)E(g0(Z)|SY0)}1P(T=1|SY0)+E(g1(Z)g0(Z)|X), and the true parameter ω is ω=ω0, i.e. θ=θ(ω0), then we have (A2) EF(Z,T,X)s(Z,T,X|ω0)=ET{g1(Z)E(g1(Z)|SY1)}P(T=1|SY1)s(Z,T,X|ω0)E(1T){g0(Z)E(g0(Z)|SY0)}1P(T=1|SY0)s(Z,T,X|ω0)+EE(g1(Z)g0(Z)|X)s(Z,T,X|ω0).(A2) For the three terms in (EquationA2), after some algebra, we have, respectively, ET{g1(Z)E(g1(Z)|SY1)}P(T=1|SY1)s(Z,T,X|ω0)=E[{g1(Y1)E(g1(Y1)|SY1)}s1(Y1|SY1,ω0)],×E(1T){g0(Z)E(g0(Z)|SY0)}1P(T=1|SY0)s(Z,T,X|ω0)=E[{g0(Z)E(g0(Z)|SY0)}s0(Y0|SY0,ω0)],×E[{E(g1(Z)g0(Z)|X)}s(Z,T,X|ω0)]=E{E(g1(Y1)g0(Y0)|X)d(X,ω0)}. Therefore, E{F(Z,T,X)s(Z,T,X|ω0)}=θ(ω0)/ω. The efficiency bound is the expected square of the projection of F on T. Because FT, the projection of F on T is itself. The conclusion follows.

For the second part of Lemma A.1, suppose S0, S1 satisfy L(S0)L(SY0), L(S1)L(SY1). Since VSY1,SY0=Var{E(g1(Y1)|X)E(g0(Y0)|X)}+EVar(g1(Y1)|X)P(T=1|SY1)+EVar(g0(Y0)|X)P(T=0|SY0),VS1,S0=Var{E(g1(Y1)|X)E(g0(Y0)|X)}+EVar(g1(Y1)|X)P(T=1|S1)+EVar(g0(Y0)|X)P(T=0|S0). We only need to prove EVar(g1(Y1)|X)P(T=1|SY1)EVar(g1(Y1)|X)P(T=1|S1). By Jensen's inequity, we have 1E{E(T|S1)|SY1}E1E(T|S1)SY1. Thus the conclusion follows from the inequality below. EVar(g1(Y1)|X)P(T=1|SY1)=EVar(g1(Y1)|X)E{E(T|S1)|SY1}EVar(g1(Y1)|X)E1E(T|S1)SY1=EEVar(g1(Y1)|X)E(T|S1)SY1=EVar(g1(Y1)|X)P(T=1|S1).

Appendix 2

Conditions for Theorem 2.1

(C1)

Yk is a continuous random variable and for any fixed τ(0,1) there exists a unique qk,τ that P(Ykqk,τ)=τ for k = 0, 1.

(C2)

πk(Sk) is bounded away from 0 and 1.

(C3)

Sk has compact support for k = 0, 1.

(C4)

The function πk(Sk), the density function f(Sk) and E(1{Ykqk,τ}|Sk) all have bounded partial derivatives with respect to Sk up to rk order, f(Sk)πk(Sk) is bounded away from 0.

(C5)

The kernel Kk is bounded up to second order derivative.

(C6)

The smoothing bandwidth hkn satisfies nhkn2, nhkndk and nhknrk0 as n. Here rk is the order of the kernel Kk.

Appendix 3

Proof of Theorem 2.1

Proof

Proof of Theorem 2.1

(i) In the case that Sk=X, Firpo (Citation2007) proved the asymptotics of θˆIPW using kernel method, Chen et al. (Citation2015) proved the asymptotics of θˆREG using kernel method. Following the proofs in their papers and substituting X by Sk, n(θˆ(S0,S1)θ) is asymptotically equivalent to (A3) 1ni=1nTig1(Zi)π1(S1i)Eg1(Y1)|S1i{Tiπ1(S1i)}π1(S1i)1ni=1n(1Ti)g0(Zi)π0(S0i)Eg0(Y0)|S0i{(1Ti)π0(S0i)}π0i(S0i)+op(1),(A3) where Ski is the ith observation of Sk, πk(Ski)=P(T=k|Sk=Ski) for k = 0, 1. By direct but tedious calculation, the covariance of the two summation terms in (EquationA3) is E(g0(Y0))E(g1(Y1))+E{E(g0(Y0)|S0)E(g1(Y1)|S1)}. Their corresponding variances are VarTg1(Z)π1(S1)E(g1(Y1)|S1)π1(S1){Tπ1(S1)}=Var{Eg1(Y1)|S1}+EVar(g1(Y1)|S1)π1(S1),Var(1T)g0(Y0)π0(S0)E(g0(Y0)|S0)π0(S0){(1T)π0(S0)}=Var{Eg0(Y0)|S0}+EVar(g0(Y0)|S0)π0(S0). Thus the asymptotic variance of θˆ(S0,S1) is VarEg1(Y1)|S1Eg0(Y0)|S0+EVar(g1(Y1)|S1)π1(S1)+EVar(g0(Y0)|S0)π0(S0). (ii) Here we only list the proof for regression type estimator θˆREG with d0=d1=1. We only derive the difference of qˆ1,τ between using true Bk and estimated Bk for regression estimator, the proof for the qˆ0,τ is similar. For simplicity, we denote S1, B1, h1n, K1, g1(), π1() as S, B, h, K, g(), π() respectively and define Kh()=h1K(/h) in the following proof. Let Δij=Kh(BˆXjBˆXi)Kh(BXjBXi), it can be verified that 1ni=1nj=1nTjg(Zj)Kh(BˆXjBˆXi)j=1nTjKh(BˆXjBˆXi)j=1nTjg(Zj)Kh(BXjBXi)j=1nTjKh(BXjBXi)=1ni=1nj=1nTjg(Zj)Kh(SjSi)+j=1nTjg(Zj)Δijj=1nTjKh(SjSi)+j=1nTjΔijj=1nTjg(Zj)Kh(SjSi)j=1nTjKh(SjSi)A1+A2+A3, where A1=1n2i=1nj=1nTjg(Zj)Δijπ(Si)f(Si)TjE(g(Zi)|Si)Δijπ(Si)f(Si),A2=1n2i=1nj=1nTjg(Zj)Δij1nl=1nTlKh(SlSi)+1nl=1nTlΔilTjg(Zj)Δijπ(Si)f(Si)Tjg(Zj)Δij1nl=1nTlKh(SlSi)+1nl=1nTlΔil,A3=1n2i=1nj=1nTjΔijE(g(Zi)|Si)l=1nTlg(Zl)Kh(SlSi)l=1nTlKh(SlSi)1nl=1nTlKh(SlSi)+1nl=1nTlΔilE(g(Zi)|Si)π(Si)f(Si)E(g(Zi)|Si)1nl=1nTlKh(SlSi)+1nl=1nTlΔilE(g(Zi)|Si)l=1nTlg(Zl)Kh(SlSi)l=1nTlKh(SlSi)1nl=1nTlKh(SlSi)+1nl=1nTlΔil+E(g(Zi)|Si)l=1nTlg(Zl)Kh(SlSi)l=1nTlKh(SlSi)1nl=1nTlKh(SlSi)+1nl=1nTlΔil. Since Δij=Kh(BˆXjBˆXi)Kh(BXjBXi), using a Taylor expansion around BXjBXi for Δij and plugging in A1, we have A1=(BˆB)n2i=1nj=1nTj{g(Zj)E(g(Zi)|Si)}π(Si)f(Si)×1hKBXjBXihXjXih+op(n1/2)(BˆB)n2i=1nj=1nQij+op(n1/2). Denote A11=i=1nj=1nQij/n2 and A˘11=i=1nj=1nE(Qij|Xi,g(Zi),Ti)/n2. Note E1hTjKSjSihXjXih(Xi,Zi,Ti)=(xi,zi,ti)=EE1hTjKSjsihXjxihSj=E1h2KSjsihE{Tj(Xjxi)|Sj}=[E{T(Xxi)|S=t}f(t)]tt=si+op(1)=E(TX|S=si)f(si)xiπ(si)f(si)xiπ(si)f(si)+{E(TX|S=t)}tt=sif(si)+op(1), and E1hTjg(Zj)KSjsihXjxih(Xi,Zi,Ti)=(xi,zi,ti)=EE1hTjg(Zj)KSjsihXjxihSj=E1h2KSjsihE{Tjg(Zj)(Xjxi)|Sj}={ETg(Z)(Xxi)|S=tf(t)}tt=si+o(1)=E(Tg(Z)X|S=si)f(si)xiπ(si)E(g(Z)|S=si)f(si)+E(Tg(Z)X|S=t)tt=sif(si)xiπ(si)E(g(Z)|S=si)f(si)xiπ(si)×E(g(Z)|S=si)Sf(si)+op(1). Therefore, A˘11=1ni=1ncov(TX,g(Z)|S=si)f(si)+E(Tg(Z)X|S=t)tt=sif(si)E(TX|S=t)tt=siE(g(Z)|S=si)f(si)xiπ(si)E(g(Z)|S=si)Sf(si)+op(1)=1ni=1ncov(TX,g(Z)|S=si)f(si)+cov(TX,g(Z)|S=t)tt=sif(si)cov(TX,g(Z)|S=t)tt=si+E(TX|S=si)E(g(Z)|S=si)Sf(si)xiπ(si)E(g(Z)|S=si)Sf(si)+op(1)=(c1)p×1+op(1), where c1=Ecov(TX,g(Z)|S)f(S)+cov(TX,g(Z)|S)Sf(S)π(S)f(S)+E{E(TX|S)Xπ(S)}E(g(Z)|S)Sπ(S)=E{π(S)1}Scov(TX,g(Z)|S)cov(X,T|S)π(S)E(g(Z)|S)S. It can be seen that the first term in c1 will equal to 0 if Y1X|S, while the second term in c1 will equal to 0 if TX|S. Thus when both Y1X|S and TX|S hold, we will have c1=0. Let A11j=(1/n)i=1nQij,A˘11j=(1/n)i=1nE(Qij|Xi,g(Zi),Ti)), we have E(A11A˘11)2=1n2j=1nE(A11jA˘11j)2+2n(n1)×jkE(A11jA˘11j)E(A11kA˘11k)=1nE(A11jA˘11j)2=1n{E(A11j2)E(A˘11j2)}1nE(A11j2)=op(1). Thus we have A11=c1+op(1), which leads to nA1=c1{n(BˆB)}+op(1). For A2, we also use a Taylor expansion for Δij: A21n2i=1nj=1n×Tjg(Zj)Δijπ(Si)f(Si)Tjg(Zj)Δij1nl=1nTlKh(SlSi)+1nl=1nTlΔil=1n2i=1nj=1n×Tjg(Zj)hKBXjBXih(BˆB)×XjXih1π(Si)f(Si)11nl=1nTlKh(SlSi)+1nl=1nTlΔil+op(n1/2). We then decompose A2 by conditioning on index i, j, that is we define A˘2=1n2i=1nj=1n×Tjg(Zj)hKBXjBXih(BˆB)XjXih×E1π(Si)f(Si)11nl=1nTlKh(SlSi)+1nl=1nTlΔilXi,g(Zi),Ti. Since E1nl=1nTlKh(SlSi)|Si=π(Si)f(Si)+op(1),E1nl=1nTlg(Zl)Kh(SlSi)|Si=π(Si)E(g(Zi)|Si)f(Si)+op(1), using a similar decomposition method as A1, we can also show nA2p0 and nA3p0. Thus we proved that 1ni=1nEˆg(Y1i)|Sˆi1ni=1nEˆg(Y1i)|Si=1ni=1nj=1nTjg(Zj)Kh(SˆjSˆi)j=1nTjKh(SˆjSˆi)j=1nTjg(Zj)Kh(SjSi)j=1nTjKh(SjSi)=c1(BˆB)+op(1/n). Note that the REG estimator for q1,τ based on estimated S is: qˆ1,τ=argmin1ni=1nEˆ(Y1it)(τ1{Y1it})|Siˆ=argmini=1nEˆ(1{Y1iq1,τ}τ)(tq1,τ)|Sˆi+Eˆ(Y1it)(1{Y1iq1,τ}1{Y1it})|Sˆi. Let u=n(tq1,τ), uˆ=n(qˆ1,τq1,τ), the optimisation will change to uˆ=argmini=1nunEˆ(1{Y1iq1,τ}τ)|Sˆi+i=1nEˆ(Y1i(q1,τ+u/n))(1{Y1iq1,τ}1{Y1iq1,τ+u/n})|Sˆii=1nunEˆ(1{Y1iq1,τ}τ)|Sˆi Similar with the proof in Firpo (Citation2007), one may check that the second term equals to n((f1(q1,τ)/2)u2+op(1)). Hence we have uˆ=n(qˆ1,τq1,τ)=n1nf1(q1,τ)i=1nEˆ(1{Y1iq1,τ}τ)|Sˆi=1ni=1nEˆg(Y1i)|Sˆi=1ni=1nEˆg(Y1i)|Si+c1n{(BˆB)}+op(1)=1ni=1nTig1(Zi)π1(S1i)Eg1(Y1)|S1i{Tiπ1(S1i)}π1(S1i)+c1n{(BˆB)}+op(1). The last equation follows from the proof in Chen et al. (Citation2015), which is the linearisation for the REG estimator using true S. Repeat all above procedure for q0,τ, and plug in the linearisation for (BˆB), we could get the linearisation for θˆ(S0,S1)θ, hence Theorem 2.1 is proved.

Appendix 4

Asymptotic variance comparisons between using Smin and using SY0,Y1

We first prove that the asymptotic variance using ST will be larger than using X. Following the proof below, one may also easily prove using Smin in De Luna et al. (Citation2011) is larger than VSY0,Y1 unless Smin=SY0,Y1, by replacing original covariate set X with SY0,Y1 and replacing ST with Smin. Adapting the proof for Theorem 2.1, we can find that for any S that satisfies T(Y0,Y1)|S, the asymptotic variance for using (S,S) in θˆ is Var{E(g(Y1)|S)E(g(Y0)|S)}+EVar(g1(Y1)|S)π1(S)+EVar(g0(Y0)|S)π0(S). Since ST satisfies TX|ST, we also have T(Y0,Y1)|X thus T(Y0,Y1)|ST. Therefore the asymptotic variance for θˆ using ST is: VST=Var{E(g1(Y1)|ST)E(g0(Y0)|ST)}+EVar(g1(Y1)|ST)π1(ST)+EVar(g0(Y0)|ST)π0(ST). The asymptotic variance for θˆ using X is: VX=Var{E(g1(Y1)|X)E(g0(Y0)|X)}+EVar(g1(Y1)|X)π1(X)+EVar(g0(Y0)|X)π0(X). Therefore (A4) VXVST=E{π11(X)1}Var(g1(Y1)|X)E{π11(ST)1}Var(g1(Y1)|ST)(A4) (A5) +E{π01(X)1}Var(g0(Y0)|X)E{π01(ST)1}Var(g0(Y0)|ST)(A5) (A6) +2EE(g1(Y1)|ST)E(g0(Y0)|ST)2EE(g1(Y1)|X)E(g0(Y0)|X).(A6) Let a1(ST)=1π1(X)1=1π1(ST)1,a0(ST)=1π0(X)1=1π0(ST)1. The expression (EquationA4) equals Evar(a1g1(Y1)|X)var(a1g1(Y1)|ST)=VarE(a1g1(Y1)|X)+VarE(a1g1(Y1)|ST)=EVarE(a1g1(Y1)|X)|ST. Similarly, the expression (EquationA5) equals E[Var{E(a0g0(Y0)|X)|ST}], and the expression (EquationA6) equals 2E[cov{E(g0(Y0)|X),E(g1(Y1)|X)|ST}]. Since a0a1=1, therefore VXVST=EVar{a1E(g1(Y1)|X)+a0E(g0(Y0)|X)}|ST0, which completes the proof.

See Figure  for difference choices of Sk and Figure A1 for the comparisons of efficiency of estimator θˆ based on different Sk, k=0,1.

Appendix 5

Asymptotic variance comparisons between using SY0,Y1 and using SYk,T

In this section, we prove that the asymptotic variance of θˆ(SY0,Y1,SY0,Y1) is smaller than asymptotic variance of θˆ(SY0,T,SY0,T), followed by those of θˆ(ST,ST). From the proof of Theorem 2.1, for all Sk satisfying TYk|Sk, the asymptotic variance of θˆ(S1,S0) is VarEg1(Y1)|S1Eg0(Y0)|S0+EVar(g1(Y1)|S1)π1(S1)+EVar(g0(Y0)|S0)π0(S0). For θˆ(SY0,Y1,SY0,Y1) and θˆ(SY0,T,SY1,T), the asymptotic variances VSY0,Y1 and VSY0,T,SY1,T are VSY0,Y1=VarEg1(Y1)|SY0,Y1Eg0(Y0)|SY0,Y1+EVar(g1(Y1)|SY0,Y1)π1(SY0,Y1)+EVar(g0(Y0)|SY0,Y1)π0(SY0,Y1)=VarEg1(Y1)|SY1Eg0(Y0)|SY0+EVar(g1(Y1)|SY1)π1(SY0,Y1)+EVar(g0(Y0)|SY0)π0(SY0,Y1)VSY0,T,SY1,T=VarEg1(Y1)|SY1,TEg0(Y0)|SY0,T+EVar(g1(Y1)|SY1,T)π1(SY1,T)+EVar(g0(Y0)|SY0,T)π0(SY0,T)=VarEg1(Y1)|SY1Eg0(Y0)|SY0+EVar(g1(Y1)|SY1)π1(SY1,T)+EVar(g0(Y0)|SY0)π0(SY0,T)=VarEg1(Y1)|SY1Eg0(Y0)|SY0+EVar(g1(Y1)|SY1)π1(ST)+EVar(g0(Y0)|SY0)π0(ST) Thus we only need to prove EVar(g1(Y1)|SY1)π1(ST)EVar(g1(Y1)|SY1)π1(SY0,Y1) By Jensen's inequity, EVar(g1(Y1)|SY1)π1(ST)=EVar(g1(Y1)|SY1)π1(X)=EEVar(g1(Y1)|SY1)π1(X)SY0,Y1=EVar(g1(Y1)|SY1)E1π1(X)SY0,Y1EVar(g1(Y1)|SY1)1Eπ1(X)|SY0,Y1=EVar(g1(Y1)|SY1)π1(SY0,Y1) Hence θˆ(SY0,Y1,SY0,Y1)θˆ(SY0,T,SY1,T). Note that from Lemma A.1 we have VSY0,T,SY1,TVX,X, i.e. θˆ(SY0,T,SY1,T)θˆ(X,X). Then the other result follows from θˆ(X,X)θˆ(ST,ST), which is proved in the Appendix 4.

Figure A1. Five choices of Sk in the space of all linear combinations of X. For (SY0,SY1) and (SY0,T,SY1,T), the first row are S0 for estimating Y0 characteristics, the second row are S1 for estimating Y1 characteristics.

Figure A1. Five choices of Sk in the space of all linear combinations of X. For (SY0,SY1) and (SY0,T,SY1,T), the first row are S0 for estimating Y0 characteristics, the second row are S1 for estimating Y1 characteristics.

Figure A2. Relative efficiencies of estimators. Solid arrow from A to B means that A is more asymptotically efficient than B. Dashed arrow from A to B means that empirically A is more efficient than B.

Figure A2. Relative efficiencies of estimators. Solid arrow from A to B means that A is more asymptotically efficient than B. Dashed arrow from A to B means that empirically A is more efficient than B.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.