701

Views

CrossRef citations to date

Altmetric

Articles

Quantile treatment effect estimation with dimension reduction

Ying Zhanga Department of Statistics, University of Wisconsin-Madison, Madison, WI, USAView further author information

Lei Wangb School of Statistics and Data Science & LPMC, Nankai University, Tianjin, People's Republic of ChinaCorrespondence[email protected]
View further author information

Menggang Yuc Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI, USAView further author information

Jun Shaoa Department of Statistics, University of Wisconsin-Madison, Madison, WI, USAView further author information

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

Quantile treatment effects can be important causal estimands in evaluation of biomedical treatments or interventions for health outcomes such as medical cost and utilisation. We consider their estimation in observational studies with many possible covariates under the assumption that treatment and potential outcomes are independent conditional on all covariates. To obtain valid and efficient treatment effect estimators, we replace the set of all covariates by lower dimensional sets for estimation of the quantiles of potential outcomes. These lower dimensional sets are obtained using sufficient dimension reduction tools and are outcome specific. We justify our choice from efficiency point of view. We prove the asymptotic normality of our estimators and our theory is complemented by some simulation results and an application to data from the University of Wisconsin Health Accountable Care Organization.

Keywords:

1. Introduction

Causal evaluation of treatment or intervention is commonly done by estimating average treatment effect (ATE). However for health outcomes such as medical cost and utilisation, quantile treatment effect (QTE) may be more relevant and informative (Abadie, Angrist, & Imbens, Citation2002; Cattaneo, Citation2010; Chernozhukov & Hansen, Citation2005; Doksum, Citation1974; Firpo, Citation2007; Frölich & Melly, Citation2010, Citation2013; Lehman, Citation1975). As outcomes tend to be highly skewed to the right, ATE may not be a proper representative parameter for location. Furthermore, it is often important to learn about distributional impacts beyond ATE, such as the effects on upper (or lower) quantiles of an outcome, which may be of direct interests to policy makers and other stakeholders of a programme.

Our study of QTE is motivated by the following investigation at the University of Wisconsin (UW) Health System. As of January 1, 2013, the UW Health System became an Accountable Care Organization (ACO), which is a network of doctors, clinics and other health care providers that share financial and medical responsibility for providing coordinated care to patients in hopes of limiting unnecessary spending. One strategy pursued by nearly all ACOs is to manage the care to ‘high-need, high-cost’ patients: those with multiple or complex conditions, often combined with behavioural health problems or socioeconomic challenges. In particular, we are asked to evaluate a particular intervention used by the UW Health System. If the intervention can reduce the upper quantiles of health care utilisation quantified by medical cost, then the next step is to significantly enhance the nurse team so that intervention can be extended to a wider population. In essence, we need to estimate QTEs particularly at an upper level.

To define QTE, we begin with some notation. Let T be a binary treatment indicator, X be a p-dimensional vector of pretreatment covariates, and $Y_{0}$ and $Y_{1}$ be the potential outcomes under treatments T = 0 and T = 1, respectively. Since only one treatment is applied, either $Y_{1}$ or $Y_{0}$ is observed, but not both, i.e. what we observe is $Z = T Y_{1} + (1 - T) Y_{0}$ . With a fixed $τ \in (0, 1)$ , the $100 τ %$ QTE is defined as $θ = q_{1, τ} - q_{0, τ}$ , where $q_{k, τ}$ is the τth quantile of $Y_{k}$ , k = 0, 1; e.g. $τ = 0.5$ , 0.25, and 0.75 give the difference of medians, lower quartiles, and upper quartiles, respectively. We focus on the estimation of θ based on a random sample ${Z_{i}, X_{i}, T_{i} : i = 1, \dots, n}$ of n observations from $(Z, X, T)$ .

Because we only observe Z, θ is often not identifiable without any condition. Throughout we assume the following assumption that is believed to be reasonable in many applications (Rosenbaum & Rubin, Citation1983): $T ⊥ (Y_{0}, Y_{1}) | X$ , i.e. T and the vector of potential outcomes $(Y_{0}, Y_{1})$ are independent conditional on X, which is similar to the ignorable missingness assumption when we treat T as a missingness indicator and unobserved $Y_{0}$ or $Y_{1}$ as a missing value. Under this assumption, two types of consistent estimators of QTE θ in causal inference or closely related context in missing data have been proposed in the literature. One type is derived through regression on $(T = k, X)$ for k = 0, 1 (Cattaneo, Citation2010; Chen, Wan, & Zhou, Citation2015; Cheng & Chu, Citation1996; Zhou, Wan, & Wang, Citation2008), and the other type is based on inverse propensity weighting with propensity score $P (T = 1 | X)$ (Firpo, Citation2007). A review is given by Cattaneo, Drukker, and Holland (Citation2013). Since parametric methods rely on correct model specifications, nonparametric estimation of the regression functions or propensity is often preferred and therefore considered in what follows.

In our ACO data, however, the dimension p of X is high and nonparametric estimation of regression or propensity function using for example the kernel method is asymptotically inefficient when $Y_{k}$ is related with only a lower dimensional function of X. Unnecessarily using a high dimensional X may also affect kernel estimation numerically. Our main task is studying covariate dimension reduction to facilitate stable and efficient estimation of QTE.

If inverse propensity weighting is applied, it seems that covariate dimension reduction is to find a linear function $S_{T}$ of X with the smallest dimension such that $T ⊥ X | S_{T}$ . Unfortunately, Hahn (Citation1998) indicated that in the estimation of ATE, using such an $S_{T}$ provides no improvement in estimation efficiency over using the entire X. Because the outcome $(Y_{1}, Y_{0})$ is involved in the estimation of ATE, Hahn (Citation2004) suggested finding a linear function $S_{Y_{0}, Y_{1}}$ of X with the smallest possible dimension such that $(Y_{0}, Y_{1}) ⊥ X | S_{Y_{0}, Y_{1}}$ , which also implies $T ⊥ (Y_{0}, Y_{1}) | S_{Y_{0}, Y_{1}}$ . The resulting ATE estimator is asymptotically more efficient than the estimator using the entire X unless $S_{Y_{0}, Y_{1}} = X$ . De Luna, Waernbaum, and Richardson (Citation2011) further considered an $S_{m i n}$ which removes the components in $S_{Y_{0}, Y_{1}}$ that are unrelated to T. This $S_{m i n}$ is the smallest dimensional $S \subseteq S_{Y_{0}, Y_{1}}$ that satisfies $T ⊥ S_{Y_{0}, Y_{1}} | S$ , which also implies $T ⊥ (Y_{0}, Y_{1}) | S_{m i n}$ . However, it is proved in the Appendix that the asymptotic variance using $S_{m i n}$ is larger than that of $S_{Y_{0}, Y_{1}}$ unless $S_{m i n} = S_{Y_{0}, Y_{1}}$ ; see also Brookhart et al. (Citation2006), Shortreed and Ertefaie (Citation2017).

Note that the estimation of $θ = q_{1, τ} - q_{0, τ}$ can be done by estimating $q_{0, τ}$ and $q_{1, τ}$ separately and then taking the difference. If a linear function $S_{Y_{k}}$ of X satisfies $Y_{k} ⊥ X | S_{Y_{k}}$ and has the smallest dimension, then $S_{Y_{k}}$ has a dimension no larger than that of $S_{Y_{0}, Y_{1}}$ defined in Hahn (Citation2004), k = 0, 1. Hence, our approach alleviates the curse of dimensionality more and it produces asymptotically more efficient estimator of θ.

In applications, $S_{Y_{0}}$ and $S_{Y_{1}}$ have to be estimated using observed data. We adopt the existing nonparametric sufficient dimension reduction methods (Cook & Weisberg, Citation1991; Li, Citation1991; Xia, Tong, Li, & Zhu, Citation2002) to construct estimators ${\hat{S}}_{Y_{k}}$ of $S_{Y_{k}}$ . We establish the asymptotic normality for our estimator of θ based on ${\hat{S}}_{Y_{0}}$ and ${\hat{S}}_{Y_{1}}$ , and compare its efficiency with an asymptotic efficiency bound. We also compare the performances of various estimators in simulation studies and apply our method to the medical cost data from the UW Health System.

2. Methods

Without dimension reduction, three types of nonparametric estimators for θ have been proposed in the literature. The inverse propensity weighting (IPW) method (Firpo, Citation2007) is a weighed version of the procedure in Koenker and Bassett (Citation1978) for the quantile estimation problem.

The regression (REG) method (Cattaneo, Citation2010; Chen et al., Citation2015) estimates the function $m_{k} (x, t) = E {ρ (Y_{k}, t) | X = x} = E {ρ (Z, t) | T = k, X = x}$ by ${\hat{m}}_{k} (x, t)$ using a nonparametric method and data under T = k for k = 0, 1 separately, where $ρ (s, t) = (s - t) (τ - 1 {s \leq t})$ is the check function (Koenker & Bassett, Citation1978) and $1 {\cdot}$ is the indicator function. Finally, Cattaneo et al. (Citation2013) and Chen et al. (Citation2015) combined IPW and REG to obtain the so-called augmented inverse propensity weighting (AIPW) estimator.

For each k, let $S_{Y_{k}} = B_{k}^{⊤} X$ with $Y_{k} ⊥ X | S_{Y_{k}}$ , where $B_{k}^{⊤}$ denotes the transpose of a $p \times d_{k}$ deterministic matrix with the smallest possible $d_{k}$ , k = 0, 1. As a consequence of Theorem 2.1 stated below, estimators using $S_{Y_{k}}$ as covariate sets are asymptotically more efficient than those using X as covariate set when $d_{k} < p$ (if $d_{k} = p$ , then $S_{Y_{k}} = X$ ). In the estimation of ATE, Hahn (Citation2004) recommended to replace X by $S_{Y_{0}, Y_{1}}$ , but the dimension of $S_{Y_{0}, Y_{1}}$ is no smaller than that of $S_{Y_{k}}$ , which leads to efficiency loss as a consequence of Theorem 2.1.

In applications, $S_{Y_{k}}$ has to be estimated by ${\hat{S}}_{Y_{k}} = {\hat{B}}_{k}^{⊤} X$ , and we adopt a nonparametric sufficient dimension reduction method to construct ${\hat{B}}_{k}$ (Cook & Weisberg, Citation1991; Li, Citation1991; Ma and Zhu, 2012; Xia et al., Citation2002). Since the distribution of $Y_{k} | X$ is the same as $Z | X, T = k$ , we separately estimate $S_{Y_{k}}$ using the observed data $(Z_{i}, X_{i})$ in group T = k. To estimate the dimensions of $S_{Y_{0}}$ and $S_{Y_{1}}$ , we adopt consistent criteria such as BIC-type criteria introduced by Zhu, Zhu, and Feng (Citation2010) and bootstrap based criteria.

Let ${\hat{S}}_{Y_{k}, i} = {\hat{B}}_{k}^{⊤} X_{i}$ , $i = 1, \dots, n$ , k = 0, 1. In our IPW method, we estimate the propensity $π_{k} (s_{k}) = P (T = k | S_{Y_{k}} = s_{k})$ by ${\hat{π}}_{k} (s_{k})$ using a nonparametric method for $k = 0, 1$ separately. The IPW estimator of θ is ${\hat{θ}}_{I P W} = {\hat{q}}_{1, τ}^{I P W} - {\hat{q}}_{0, τ}^{I P W}$ , where (1) ${\hat{q}}_{k, τ}^{I P W} = {a r g m i n}_{t} \sum_{i = 1}^{n} \frac{T_{i}^{(k)} ρ (Z_{i}, t)}{{\hat{π}}_{k} ({\hat{S}}_{Y_{k}, i})}, k = 0, 1,$ (1) and $T_{i}^{(1)} = T_{i}$ , $T_{i}^{(0)} = 1 - T_{i}$ .

The REG method estimates $m_{k} (s_{k}, t) = E {ρ (Y_{k}, t) | S_{Y_{k}} = s_{k}}$ by ${\hat{m}}_{k} (s_{k}, t)$ using a nonparametric method for k = 0, 1 separately, and estimates θ by ${\hat{θ}}_{R E G} = {\hat{q}}_{1, τ}^{R E G} - {\hat{q}}_{0, τ}^{R E G}$ , where (2) ${\hat{q}}_{k, τ}^{R E G} = {a r g m i n}_{t} \sum_{i = 1}^{n} {\hat{m}}_{k} ({\hat{S}}_{Y_{k}, i}, t), k = 0, 1.$ (2) We can combine IPW and REG to obtain our AIPW estimator, ${\hat{θ}}_{A I P W} = {\hat{q}}_{1, τ}^{A I P W} - {\hat{q}}_{0, τ}^{A I P W}$ , where (3) $\begin{aligned} {\hat{q}}_{k, τ}^{A I P W} & = {a r g m i n}_{t} \sum_{i = 1}^{n} [\frac{T_{i}^{(k)} ρ (Z_{i}, t)}{{\hat{π}}_{k} ({\hat{S}}_{Y_{k}, i})} \\ - \frac{T_{i}^{(k)} - {\hat{π}}_{k} ({\hat{S}}_{Y_{k}, i})}{{\hat{π}}_{k} ({\hat{S}}_{Y_{k}, i})} {\hat{m}}_{k} ({\hat{S}}_{Y_{k}, i}, t)], k = 0, 1. \end{aligned}$ (3) To estimate $m_{k} (s_{k}, t)$ and $π_{k} (s_{k})$ in (Equation1(1) ${\hat{q}}_{k, τ}^{I P W} = {a r g m i n}_{t} \sum_{i = 1}^{n} \frac{T_{i}^{(k)} ρ (Z_{i}, t)}{{\hat{π}}_{k} ({\hat{S}}_{Y_{k}, i})}, k = 0, 1,$ (1) )–(Equation3(3) $\begin{aligned} {\hat{q}}_{k, τ}^{A I P W} & = {a r g m i n}_{t} \sum_{i = 1}^{n} [\frac{T_{i}^{(k)} ρ (Z_{i}, t)}{{\hat{π}}_{k} ({\hat{S}}_{Y_{k}, i})} \\ - \frac{T_{i}^{(k)} - {\hat{π}}_{k} ({\hat{S}}_{Y_{k}, i})}{{\hat{π}}_{k} ({\hat{S}}_{Y_{k}, i})} {\hat{m}}_{k} ({\hat{S}}_{Y_{k}, i}, t)], k = 0, 1. \end{aligned}$ (3) ), we use the nonparametric kernel estimators (Silverman, Citation1986): $\begin{aligned} {\hat{m}}_{k} (s_{k}, t) & = \frac{\sum_{i = 1}^{n} T_{i}^{(k)} ρ (Z_{i}, t) K_{H_{k}} ({\hat{S}}_{Y_{k}, i} - s_{k})}{\sum_{i = 1}^{n} T_{i}^{(k)} K_{H_{k}} ({\hat{S}}_{Y_{k}, i} - s_{k})}, \\ {\hat{π}}_{k} (s_{k}) & = \frac{\sum_{i = 1}^{n} T_{i}^{(k)} K_{H_{k}} ({\hat{S}}_{Y_{k}, i} - s_{k})}{\sum_{i = 1}^{n} K_{H_{k}} ({\hat{S}}_{Y_{k}, i} - s_{k})}, k = 0, 1, \end{aligned}$ where $K_{H_{k}} (s_{k}) = d e t (H_{k}^{- 1}) K_{k} (H_{k}^{- 1} s_{k})$ , $K_{k} (\cdot)$ is a $d_{k}$ -dimensional kernel function, $d_{k}$ is the dimension of ${\hat{S}}_{Y_{k}}$ , and $H_{k}$ is the bandwidth matrix. When ${\hat{S}}_{Y_{k}}$ is standardised, we consider $H_{k} = h_{k n} I_{d_{k}}$ with scalar bandwidth $h_{k n}$ and identity matrix $I_{d_{k}}$ (Hardle, Muller, Sperlich, & Werwatz, Citation2004). As in Hu, Follmann, and Wang (Citation2014), the nonparametric kernel estimators are computed using the rth order Gaussian product kernel with standardised covariates. The bandwidth we used here is $h_{k n} = C n^{- 2 / (2 r_{k} + d_{k})}$ , where $r_{k}$ is the order of $K_{k}$ , k = 0, 1. To determine the constant C we adopt the J-fold cross validation, i.e. we select C that minimises $\sum_{j = 1}^{J} (\hat{θ} - {\hat{θ}}_{- j})^{2},$ where J is the total number of folds and ${\hat{θ}}_{- j}$ is the estimator of θ with all data but not those in the jth fold, $j = 1, \dots, J$ . We use J = 10 in our simulations in Section 3.

The following theorem establishes the asymptotic normality of estimators in (Equation1(1) ${\hat{q}}_{k, τ}^{I P W} = {a r g m i n}_{t} \sum_{i = 1}^{n} \frac{T_{i}^{(k)} ρ (Z_{i}, t)}{{\hat{π}}_{k} ({\hat{S}}_{Y_{k}, i})}, k = 0, 1,$ (1) )–(Equation3(3) $\begin{aligned} {\hat{q}}_{k, τ}^{A I P W} & = {a r g m i n}_{t} \sum_{i = 1}^{n} [\frac{T_{i}^{(k)} ρ (Z_{i}, t)}{{\hat{π}}_{k} ({\hat{S}}_{Y_{k}, i})} \\ - \frac{T_{i}^{(k)} - {\hat{π}}_{k} ({\hat{S}}_{Y_{k}, i})}{{\hat{π}}_{k} ({\hat{S}}_{Y_{k}, i})} {\hat{m}}_{k} ({\hat{S}}_{Y_{k}, i}, t)], k = 0, 1. \end{aligned}$ (3) ) and assesses the efficiency of estimators.

Theorem 2.1

Assume the conditions stated in the Appendix. Let $\hat{θ} (S_{0}, S_{1})$ be one of ${\hat{θ}}_{I P W}$ , ${\hat{θ}}_{R E G}$ , and ${\hat{θ}}_{A I P W}$ in (Equation1(1) ${\hat{q}}_{k, τ}^{I P W} = {a r g m i n}_{t} \sum_{i = 1}^{n} \frac{T_{i}^{(k)} ρ (Z_{i}, t)}{{\hat{π}}_{k} ({\hat{S}}_{Y_{k}, i})}, k = 0, 1,$ (1) )–(Equation3(3) $\begin{aligned} {\hat{q}}_{k, τ}^{A I P W} & = {a r g m i n}_{t} \sum_{i = 1}^{n} [\frac{T_{i}^{(k)} ρ (Z_{i}, t)}{{\hat{π}}_{k} ({\hat{S}}_{Y_{k}, i})} \\ - \frac{T_{i}^{(k)} - {\hat{π}}_{k} ({\hat{S}}_{Y_{k}, i})}{{\hat{π}}_{k} ({\hat{S}}_{Y_{k}, i})} {\hat{m}}_{k} ({\hat{S}}_{Y_{k}, i}, t)], k = 0, 1. \end{aligned}$ (3) ) with ${\hat{S}}_{Y_{k}}$ replaced by $S_{k} = B_{k}^{⊤} X$ satisfying $Y_{k} ⊥ X ∣ S_{k}$ , k = 0, 1, and let $\hat{θ} ({\hat{S}}_{0}, {\hat{S}}_{1})$ be the same estimator with $S_{k}$ replaced by its estimator ${\hat{S}}_{k} = {\hat{B}}_{k}^{⊤} X$ , where $\sqrt{n} v e c ({\hat{B}}_{k} - B_{k}) = n^{- 1 / 2} \sum_{i = 1}^{n} ψ_{k} (X_{i}, Z_{i}, T_{i}) + o_{p} (1)$ for some functions $ψ_{k}$ with $E (ψ_{k} (X, Z, T)) = 0$ , k = 0, 1, $v e c (M)$ is a column vector whose components are elements of a matrix M, and $o_{p} (1)$ denotes a quantity converging to 0 in probability. Then we have the following conclusions.

(i) $\sqrt{n} {\hat{θ} (S_{0}, S_{1}) - θ}$ is asymptotically normal with mean 0 and variance (4) $\begin{aligned} V_{S_{0}, S_{1}}^{*} & = v a r \{E (g_{1} (Y_{1}) | S_{1}) - E (g_{0} (Y_{0}) | S_{0})\} \\ + \sum_{k = 0, 1} E \{\frac{v a r (g_{k} (Y_{k}) | S_{k})}{P (T = k | S_{k})}\}, \end{aligned}$ (4) where $g_{k} (Y_{k}) = - (1 {Y_{k} \leq q_{k, τ}} - τ) / f_{k} (q_{k, τ})$ and $f_{k}$ is the p.d.f. of $Y_{k}$ , $k = 0, 1$ .

(ii) $\sqrt{n} {\hat{θ} ({\hat{S}}_{0}, {\hat{S}}_{1}) - θ}$ is asymptotically normal with mean 0 and variance (5) $\begin{aligned} V_{S_{0}, S_{1}} & = V_{S_{0}, S_{1}}^{*} + v a r \{\sum_{k = 0, 1} c_{k}^{⊤} ψ_{k} (X, Z, T)\} \\ + 2 c o v \{\sum_{k = 0, 1} c_{k}^{⊤} ψ_{k} (X, Z, T), S (X, Z, T)\}, \end{aligned}$ (5) where $c_{k} = - v e c (E [\frac{c o v (X, T | S_{k})}{π_{k} (S_{k})} {\{\frac{\partial E (g_{k} (Y_{k}) | S_{k})}{\partial S_{k}}\}}^{⊤}]),$ and $\begin{aligned} S (X, Z, T) & = \sum_{k = 0, 1} (- 1)^{(k - 1)} [\frac{T^{(k)}}{π_{k} (S_{k})} {g_{k} (Y_{k}) \\ - E (g_{k} (Y_{k}) | S_{k})} + E (g_{k} (Y_{k}) | S_{k})] . \end{aligned}$

Theorem 2.1(i) justifies our choice of $S_{k} = S_{Y_{k}}$ . $V_{S_{0}, S_{1}}^{*}$ in (EquationA1(A1) $\begin{aligned} V_{S_{0}, S_{1}}^{*} & = v a r \{E (g_{1} (Y_{1}) | S_{1}) - E (g_{0} (Y_{0}) | S_{0})\} \\ + \sum_{k = 0, 1} E \{\frac{v a r (g_{k} (Y_{k}) | S_{k})}{P (T = k | S_{k})}\}, \end{aligned}$ (A1) ) is in fact the semiparametric efficiency bound of estimating θ following the ideas in Bickel, Klaassen, Ritov, and Wellner (Citation1993), Hahn (Citation1998) and Firpo (Citation2007). Details can be found in Lemma A.1 in the Appendix. However, in practice, the IPW estimator may not have enough estimation efficiency, as it does not fully extract the information contained in the auxiliary variables. While, the REG and AIPW estimators use all observed covariates to improve estimation efficiency.

By (EquationA1(A1) $\begin{aligned} V_{S_{0}, S_{1}}^{*} & = v a r \{E (g_{1} (Y_{1}) | S_{1}) - E (g_{0} (Y_{0}) | S_{0})\} \\ + \sum_{k = 0, 1} E \{\frac{v a r (g_{k} (Y_{k}) | S_{k})}{P (T = k | S_{k})}\}, \end{aligned}$ (A1) ) and Jensen's inequality, among all linear functions $(S_{0}, S_{1})$ satisfying $Y_{k} ⊥ X | S_{k}, k = 0, 1$ , $V_{S_{0}, S_{1}}^{*}$ is minimised when $S_{k}$ has the smallest possible dimension, i.e. $S_{k} = S_{Y_{k}}$ , k = 0, 1. In particular, this applies to $S_{0} = S_{1} = S_{Y_{0}, Y_{1}}$ proposed in Hahn (Citation2004), since the dimension of $S_{Y_{0}, Y_{1}}$ is no smaller than that of $S_{Y_{k}}$ .

The sum of last two terms on the right hand side of (Equation5(5) $\begin{aligned} V_{S_{0}, S_{1}} & = V_{S_{0}, S_{1}}^{*} + v a r \{\sum_{k = 0, 1} c_{k}^{⊤} ψ_{k} (X, Z, T)\} \\ + 2 c o v \{\sum_{k = 0, 1} c_{k}^{⊤} ψ_{k} (X, Z, T), S (X, Z, T)\}, \end{aligned}$ (5) ) quantifies the price we may pay for estimating $B_{k}$ by ${\hat{B}}_{k}$ . There is an efficiency loss due to estimating $S_{Y_{k}}$ by ${\hat{S}}_{Y_{k}}$ when this sum is positive, while it is possible that this sum is negative so that we have an efficiency gain. If we further include the covariates related to T, i.e. consider $S_{Y_{k}, T}$ being the smallest possible dimensional $S_{k}$ satisfying $T ⊥ X | S_{k}$ and $Y_{k} ⊥ X | S_{k}$ , k = 0, 1, then $c o v (X, T | S_{k}) = 0$ and $c_{k} = 0$ , hence, $\hat{θ} ({\hat{S}}_{Y_{0}, T}, {\hat{S}}_{Y_{1}, T})$ and $\hat{θ} (S_{Y_{0}, T}, S_{Y_{1}, T})$ are asymptotically equivalent. However, it is generally not a good idea to use $({\hat{S}}_{Y_{0}, T}, {\hat{S}}_{Y_{1}, T})$ , because each $S_{Y_{k}, T}$ has a dimension no smaller than that of $S_{Y_{k}}$ and therefore both $\hat{θ} ({\hat{S}}_{Y_{0}, T}, {\hat{S}}_{Y_{1}, T})$ and $\hat{θ} (S_{Y_{0}, T}, S_{Y_{1}, T})$ is less efficient than $\hat{θ} (S_{Y_{0}}, S_{Y_{1}})$ according to Theorem 2.1. Although $\hat{θ} ({\hat{S}}_{Y_{0}}, {\hat{S}}_{Y_{1}})$ may be less efficient than $\hat{θ} (S_{Y_{0}}, S_{Y_{1}})$ due to the estimation of $S_{Y_{k}}$ , it may still be more efficient than $\hat{θ} ({\hat{S}}_{Y_{0}, T}, {\hat{S}}_{Y_{1}, T})$ . Some simulation results are given in Section 3.

In Theorem 2.1, the condition $\sqrt{n} v e c ({\hat{B}}_{k} - B_{k}) = n^{- 1 / 2} \sum_{i = 1}^{n} ψ_{k} (X_{i}, Z_{i}, T_{i}) + o_{p} (1)$ with $E ψ_{k} (X, Z, T) = 0$ is satisfied for ${\hat{B}}_{k}$ obtained using some sufficient dimension reduction methods (Hsing & Carroll, Citation1992; Zhu & Ng, Citation1995).

3. Simulation

We investigate the finite-sample performance of three estimators, ${\hat{θ}}_{I P W}$ , ${\hat{θ}}_{R E G}$ , and ${\hat{θ}}_{A I P W}$ , with four choices of linear functions $(S_{0}, S_{1})$ , (1) $S_{k} = S_{Y_{k}}$ , k = 0, 1, (2) $S_{0} = S_{1} = S_{Y_{0}, Y_{1}}$ , (3) $S_{k} = S_{Y_{k}, T}$ , k = 0, 1, and (4) $S_{0} = S_{1} = S_{T}$ . For each choice of $(S_{0}, S_{1})$ , we consider estimators using the true $(S_{0}, S_{1})$ as well as $({\hat{S}}_{0}, {\hat{S}}_{1})$ by sufficient dimension reduction. Thus, we consider a total of $3 \times 4 \times 2 = 24$ cases. In each case, we consider the estimation of the QTEs with $τ = 25 %$ , $50 %$ , and $75 %$ , under two different sample sizes n = 200 and n = 1000.

In the first simulation, $X = (X_{1}, \dots, X_{7})^{⊤}$ with independent $N (0, 1)$ components, $P (T = 1 | X) = \exp (2 X_{4}) {1 + \exp (2 X_{4})}^{- 1}$ , $Y_{0} = 3 X_{1} + 6 X_{2} + 3 X_{3} + ϵ_{0}$ , and $Y_{1} = 10 + 3 X_{1} + 6 X_{2} + 3 X_{3} + 3 X_{4} + ϵ_{1}$ , where $ϵ_{k}$ 's are independent $N (0, 1)$ and are independent of X. The outcome models are linear in X, the treatment model is logistic, and the log-conditional treatment odds is linear in X. Under this model, $S_{Y_{0}}$ , $S_{Y_{1}}$ , and $S_{T}$ are all one-dimensional, while $S_{Y_{0}, Y_{1}} = S_{Y_{1}, T} = S_{Y_{0}, T}$ is two-dimensional.

In the second simulation, $X = (X_{1}, \dots, X_{7})^{⊤}$ with independent $N (0, 1)$ components, $P (T = 1 | X) = \exp (- 2 X_{5} + 0.7 X_{6}^{2} - 0.5 X_{7}^{2}) {1 + \exp (- 2 X_{5} + 0.7 X_{6}^{2} - 0.5 X_{7}^{2})}^{- 1}$ , $Y_{0} = 3 (X_{1} + X_{2} + 2 X_{3} + 2 X_{4}) + 1.5 X_{6}^{2} + ϵ_{0}$ , and $Y_{1} = 12 + 3 (X_{1} + X_{2} + 2 X_{3} + X_{4} + X_{5}) + 1.5 X_{7}^{2} + ϵ_{1}$ , where $ϵ_{k}$ 's are independent $N (0, 1)$ and are independent of X. The outcome models are nonlinear in X, the treatment model is logistic, and the log-conditional treatment odds is nonlinear in X. Under this setting, each $S_{Y_{k}}$ is two-dimensional, $S_{T}$ is three-dimensional, while $S_{Y_{0}, Y_{1}}$ , $S_{Y_{1}, T}$ , and $S_{Y_{0}, T}$ are four-dimensional and not the same.

Based on 1000 simulation runs, we calculate the simulated relative bias (RB) and standard deviation (SD) in each scenario. The results for simulations are given in Tables –, respectively. The following conclusions can be obtained from the simulation results in Tables –.

When the true $(S_{0}, S_{1})$ is used, $(S_{Y_{0}}, S_{Y_{1}})$ leads to the best performance overall, followed by $S_{Y_{0}, Y_{1}}$ , $(S_{Y_{0}, T}, S_{Y_{1}, T})$ , and $S_{T}$ , in agreement with our asymptotic results discussed in Section 2 and proved in the Appendix.
When estimator $({\hat{S}}_{Y_{0}}, {\hat{S}}_{Y_{1}})$ is used, the resulting estimators of θ are in general less efficient than those based on the true $(S_{Y_{0}}, S_{Y_{1}})$ , but they are still better than the estimators based on other choices of $(S_{0}, S_{1})$ regardless of whether the true or estimated $(S_{0}, S_{1})$ is used.
The performances of estimators using the true $(S_{Y_{0}, T}, S_{Y_{1}, T})$ and $({\hat{S}}_{Y_{0}, T}, {\hat{S}}_{Y_{1}, T})$ are quite similar when n = 1000, in agreement with the asymptotic results in Theorem 2.1 and our discussion after Theorem 2.1. They are worse than those using $({\hat{S}}_{Y_{0}}, {\hat{S}}_{Y_{1}})$ .
Consistent with the asymptotic theory, the performance of estimators using $S_{T}$ is the worst, and the efficiency loss is substantial in most cases. Note that using estimated $S_{T}$ is actually better than using the true $S_{T}$ .
Regarding the three different estimation methods, ${\hat{θ}}_{R E G}$ and ${\hat{θ}}_{A I P W}$ have very comparable performances and are recommended in practice.

Table 1. Relative bias and standard deviation for simulation 1 with true or estimated $S_{0}$ and $S_{1}$ .

Display Table

Table 2. Relative bias and standard deviation for simulation 2 with true or estimated $S_{0}$ and $S_{1}$ .

Display Table

4. Real data analysis

As we mentioned in the introduction, the University of Wisconsin Health System became an Accountable Care Organization (ACO) and implemented a Complex Care Management (CCM) programme since January 1, 2013. In particular, a team of nurses take responsibility for coordinating and implementing complex patients' care plan. The CCM is very intensive in time and resources and therefore it is important to evaluate its specific value.

We demonstrate the proposed estimation methods in a data set resulted from the University of Wisconsin Health ACO study where the primary outcome Z is the annualised payment amount in thousands. The data set consists of 894 patients with 186 in the CCM group (T = 1) and 708 not in the CCM group (T = 0).

Two issues with this dataset actually motivated our study. First, the distribution of annualised payment is right-skewed as shown by the box plots in Figure for all patients and two groups. The overall median, mean, 75% quantile, and maximum of observed payment are about 13, 31, 41, and 376 thousand dollars, respectively. This suggests the need for estimating quantile treatment effects. Second, the dataset consists of three discrete and ninety-four continuous covariates including medicare status, baseline payments, as well as other baseline characteristics of patients. Thus, dimension reduction is needed in nonparametric kernel estimation.

Figure 1. Boxplots of observed annualised payment amount (in thousands) for overall, CCM group T = 1, and non-CCM group T = 0.

For sufficient dimension reduction, we adopt the semiparametric directional regression method proposed by Ma and Zhu (2012). After sufficient dimension reduction, $S_{T}$ has 7 dimensions, $S_{Y_{0}}$ has 5 dimensions, $S_{Y_{1}}$ has 8 dimensions, and $S_{Y_{0}, Y_{1}}$ has 13 dimensions.

Results for three choices of $(S_{0}, S_{1})$ considered in simulation are shown in Table for estimating ATE and QTE with $τ = 25 %$ , 50%, and 75%. Standard errors (SE) for all estimates are calculated using the bootstrap with 200 replications.

Table 3. Estimates and standard errors (SE) for the University of Wisconsin Health ACO data.

Display Table

From Table , the 25% and 50% QTEs are not significant by all methods. When $S_{Y_{k}}$ or $S_{Y_{0}, Y_{1}}$ is used, the 75% QTE is significantly less than 0, and in terms of SE, the estimate using $S_{Y_{k}}$ , k = 0, 1, is more efficient than the estimate using $S_{Y_{0}, Y_{1}}$ . However, the estimate of 75% QTE using $S_{T}$ is inefficient due to the large variation of using $S_{T}$ so that the result is insignificant.

Since 75% QTE is significantly negative, the result indicates that receiving CCM intervention effectively helps reducing medical payment for the high-cost patients. But CCM intervention is not so useful for the low-cost or median-cost patients, as 25% and 50% QTEs are not significant. These results may be useful for decision making in ACO.

For comparison, we also include estimates of ATE and SE. The results in the last block of Table show that ATE is not significant by all methods. It is interesting to see that estimates of ATE are all negative whereas estimates of 50% QTE are all positive although they are not significant, which may be caused by fact that the distribution of annualised payment is right-skewed.

The example shows the usefulness of assessing QTEs with different percentages. If we only estimate ATE, no useful conclusion can be made in this example. Even if we check 50% QTE instead of ATE because of the existing skewness, we still cannot obtain any useful conclusion.

Acknowledgments

We are grateful to the editor, the associate editor, and two anonymous referees for their insightful comments and suggestions, which have led to significant improvements.

Disclosure statement

No potential conflict of interest was reported by the authors.

Additional information

Funding

Our research was supported by the National Natural Science Foundation of China (11871287, 11831008), the Natural Science Foundation of Tianjin (18JCYBJC41100), the Fundamental Research Funds for the Central Universities, the Key Laboratory for Medical Data Analysis and Statistical Research of Tianjin, the Chinese 111 Project (B14019), the U.S. National Science Foundation (DMS-1612873 and DMS-1914411). This research was partially supported through a Patient-Centered Outcomes Research Institute (PCORI) Award (ME-1409-21219).

Notes on contributors

Ying Zhang

Ying Zhang is a Ph.D. candidate, Department of Statistics, University of Wisconsin-Madison.

Lei Wang

Dr Lei Wang holds a Ph.D. in statistics from East China Normal University. He is an assistant professor of statistics at Nankai University. His research interests include empirical likelihood and missing data problems.

Menggang Yu

Dr Menggang Yu holds a Ph.D. in biostatistics from the University of Michigan. He is now a professor of biostatistics at the University of Wisconsin-Madison. Besides developing statistical methodology related to cancer research and clinical trials, Dr Yu is also very interested in health services research.

Jun Shao

Dr Jun Shao holds a Ph.D. in statistics from the University of Wisconsin-Madison. He is a professor of statistics at the University of Wisconsin-Madison. His research interests include variable selection and inference with high dimensional data, sample surveys, and missing data problems.

References

Abadie, A., Angrist, J., & Imbens, G. W. (2002). Instrumental variables estimates of the effect of subsidized training on the quantiles of trainee earnings. Econometrica, 70, 91–117. doi: 10.1111/1468-0262.00270
Web of Science ®Google Scholar
Bickel, P. J., Klaassen, C. J., Ritov, Y., & Wellner, J. (1993). Efficient and adaptive inference in semiparametric models. Baltimore: Johns Hopkins University Press.
Google Scholar
Brookhart, M., Schneeweiss, S., Rothman, K., Glynn, R., Avorn, J., & Sturmer, T. (2006). Variable selection for propensity score models. American Journal of Epidemiology, 163, 1149–1156. doi: 10.1093/aje/kwj149
PubMed Web of Science ®Google Scholar
Cattaneo, M. D. (2010). Efficient semiparametric estimation of multi-valued treatment effects under ignorability. Journal of Econometrics, 155, 138–154. doi: 10.1016/j.jeconom.2009.09.023
Web of Science ®Google Scholar
Cattaneo, M. D., Drukker, D. M., & Holland, A. D. (2013). Estimation of multivalued treatment effects under conditional independence. The Stata Journal, 13, 407–450. doi: 10.1177/1536867X1301300301
Web of Science ®Google Scholar
Chen, X., Wan, A. T. K., & Zhou, Y. (2015). Efficient quantile regression analysis with missing observations. Journal of the American Statistical Association, 110, 723–741. doi: 10.1080/01621459.2014.928219
Web of Science ®Google Scholar
Cheng, P. E., & Chu, C. (1996). Kernel estiamation of distribution functions and quantiles with missing data. Statistica Sinica, 6, 63–78.
Web of Science ®Google Scholar
Chernozhukov, V., & Hansen, C. (2005). An IV model of quantile treatment effects. Econometrica, 73, 245–261. doi: 10.1111/j.1468-0262.2005.00570.x
Web of Science ®Google Scholar
Cook, R. D., & Weisberg, S. (1991). Discussion of ‘Sliced inverse regression for dimension reduction’. Journal of the American Statistical Association, 86, 328–332.
Web of Science ®Google Scholar
De Luna, X., Waernbaum, I., & Richardson, T. S. (2011). Covariate selection for the nonparametric estimation of an average treatment effect. Biometrika, 98, 861–875. doi: 10.1093/biomet/asr041
Web of Science ®Google Scholar
Doksum, K. (1974). Empirical probability plots and statistical inference for nonlinear models in the two-sample case. The Annals of Statistics, 2, 267–277. doi: 10.1214/aos/1176342662
Web of Science ®Google Scholar
Firpo, S. (2007). Efficient semiparametric estimation of quantile treatment effects. Econometrica, 75, 259–276. doi: 10.1111/j.1468-0262.2007.00738.x
Web of Science ®Google Scholar
Frölich, M., & Melly, B. (2010). Estimation of quantile treatment effects with Stata. The Stata Journal, 10, 423–457. doi: 10.1177/1536867X1001000309
Web of Science ®Google Scholar
Frölich, M., & Melly, B. (2013). Unconditional quantile treatment effects under endogeneity. Journal of Business and Economic Statistics, 31, 346–357. doi: 10.1080/07350015.2013.803869
Web of Science ®Google Scholar
Hahn, J. (1998). On the role of the propensity score in efficient semiparametric estimation of average treatment effects. Econometrica, 66, 315–331. doi: 10.2307/2998560
Web of Science ®Google Scholar
Hahn, J. (2004). Functional restriction and efficiency in causal inference. Review of Economics and Statistics, 86, 73–76. doi: 10.1162/003465304323023688
Web of Science ®Google Scholar
Hardle, W., Muller, M., Sperlich, S., & Werwatz, A. (2004). Nonparametric and semiparametric models. Heidelberg: Springer-Verlag.
Google Scholar
Hsing, T., & Carroll, R. J. (1992). An asymptotic theory for sliced inverse regression. The Annals of Statistics, 20, 1040–1061. doi: 10.1214/aos/1176348669
Web of Science ®Google Scholar
Hu, Z., Follmann, D. A., & Wang, N. (2014). Estimation of mean response via the effective balancing score. Biometrika, 101, 613–624. doi: 10.1093/biomet/asu022
PubMed Web of Science ®Google Scholar
Koenker, R., & Bassett, G. (1978). Regression quantiles. Econometrica, 46, 33–50. doi: 10.2307/1913643
Web of Science ®Google Scholar
Lehman, E. L. (1975). Nonparametrics: Statistical methods based on ranks. San Francisco: Holden-Day.
Google Scholar
Li, K. C. (1991). Sliced inverse regression for dimension reduction. Journal of the American Statistical Association, 86, 316–327. doi: 10.1080/01621459.1991.10475035
Web of Science ®Google Scholar
Rosenbaum, P. R., & Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70, 41–55. doi: 10.1093/biomet/70.1.41
Web of Science ®Google Scholar
Shortreed, S. M., & Ertefaie, A. (2017). Outcome-adaptive Lasso: Variable selection for causal inference. Biometrics, 73, 1111–1122. doi: 10.1111/biom.12679
PubMed Web of Science ®Google Scholar
Silverman, B. W. (1986). Density estimation for statistics and data analysis. London: Chapman and Hall.
Google Scholar
Xia, Y., Tong, H., Li, W. K., & Zhu, L. X. (2002). An adaptive estimation of dimension reduction space. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 64, 363–410. doi: 10.1111/1467-9868.03411
Web of Science ®Google Scholar
Zhou, Y., Wan, A. T. K., & Wang, X. (2008). Estimating equations inference with missing data. Journal of the American Statistical Association, 103, 1187–1199. doi: 10.1198/016214508000000535
Web of Science ®Google Scholar
Zhu, L. X., & Ng, K. W. (1995). Asymptotics of sliced inverse regression. Statistica Sinica, 5, 727–736.
Web of Science ®Google Scholar
Zhu, L. P., Zhu, L. X., & Feng, Z. H. (2010). Dimension reduction in regressions through cumulative slicing estimation. Journal of the American Statistical Association, 105, 1455–1466. doi: 10.1198/jasa.2010.tm09666
Web of Science ®Google Scholar

Appendices

Appendix 1. Semiparametric efficiency bound of estimating θ with $S_{k}$

Throughout the Appendix, the S, $S_{k}$ , $S_{Y_{k}}$ , $S_{Y_{0}, Y_{1}}$ , $S_{T}$ , $S_{m i n}$ are linear functions of X, i.e. $S = B^{⊤} X$ with B being a $p \times d$ matrix, $S_{k} = B_{k}^{⊤} X$ with $B_{k}$ being a $p \times d_{k}$ matrix, etc.

Lemma A.1

Assume $T ⊥ (Y_{0}, Y_{1}) | X$ and $Y_{k} ⊥ X | S_{k}$ , and the distribution of $Y_{k}$ has density $f_{k}$ with $f_{k} (q_{k, τ}) > 0$ , k = 0, 1. A lower bound for the asymptotic variance of any asymptotically normal estimator of $θ = q_{1, τ} - q_{0, τ}$ is given by (A1) $\begin{aligned} V_{S_{0}, S_{1}}^{*} & = v a r \{E (g_{1} (Y_{1}) | S_{1}) - E (g_{0} (Y_{0}) | S_{0})\} \\ + \sum_{k = 0, 1} E \{\frac{v a r (g_{k} (Y_{k}) | S_{k})}{P (T = k | S_{k})}\}, \end{aligned}$ (A1) where $g_{k} (Y_{k}) = - (1 {Y_{k} \leq q_{k, τ}} - τ) / f_{k} (q_{k, τ})$ , k = 0, 1. If $Y_{k} ⊥ X | S_{k}$ , $Y_{k} ⊥ X | S_{k}^{'}$ , and $L (S_{k}) \subseteq L (S_{k}^{'})$ , $k = 0, 1$ , then $V_{S_{0}, S_{1}}^{*} \leq V_{S_{0}^{'}, S_{1}^{'}}^{*}$ , where $L (S)$ denotes the linear space generated by columns of B for $S = B^{⊤} X$ .

Proof

Proof of Lemma A.1

Our derivation of the efficiency bound mimics the proof in Firpo (Citation2007) which is a direct application of the semiparametric efficiency theory from Bickel et al. (Citation1993). Following the proof of Firpo (Citation2007), one may easily see that knowing $T | X = T | S_{T}$ won't change the semiparametric efficiency bound, which is similar with the ATE case in Hahn (Citation1998). In our proof for Lemma A.1, one only needs to carefully keep $S_{1}$ and $S_{0}$ separate in the derivation. The construction of the efficient influence function is more involved algebraically. We only provide a sketch of the proof for the case $S_{k} = S_{Y_{k}}$ here. The density of $(Y_{0}, Y_{1}, T, X)$ at $(y_{0}, y_{1}, k, x)$ is $q (y_{0}, y_{1}, k, x) = g (y_{0}, y_{1} | x) π (x)^{k} {1 - π (x)}^{1 - k} f (x),$ where $g (y_{0}, y_{1} | x)$ denotes the conditional distribution of $(Y_{0}, Y_{1})$ given X, $f (x)$ denotes the marginal distribution of X and $π (x) = P (T = 1 | X = x)$ . The density of $(Z, T, X)$ at $(z, k, x)$ is then equal to $\begin{aligned} q (z, k, x) & = {g_{1} (z | x) π (x)}^{k} {g_{0} (z | x) (1 - π (x))}^{1 - k} f (x) \\ = {h_{1} (z | S_{y_{1}}) π (x)}^{k} {h_{0} (z | S_{y_{0}}) (1 - π (x))}^{1 - k} f (x), \end{aligned}$ where $g_{1} (\cdot | x) = \int g (y_{0}, \cdot | x) d y_{0}$ , $g_{0} (\cdot | x) = \int g (\cdot, y_{1} | x) d y_{1}$ . The second equality holds because by the definition of $S_{Y_{1}}, S_{Y_{0}}$ , there exist functions $h_{1}$ and $h_{0}$ that $g_{1} (\cdot | x) = h_{1} (\cdot | S_{y_{1}})$ and $g_{0} (\cdot | x) = h_{0} (\cdot | S_{y_{0}})$ . For a regular parametric submodel $q (z, k, x)$ with parameter w, $\begin{aligned} q_{ω} (z, k, x) & = {h_{1} (z | S_{y_{1}}, ω) π (x, ω)}^{k} \\ \times {h_{0} (z | S_{y_{0}}, ω) (1 - π (x, ω))}^{1 - k} f (x, ω) . \end{aligned}$ The score function of this parametric submodel is $\begin{aligned} s (z, k, x | ω) & = k s_{1} (z | S_{y_{1}}, ω) + (1 - k) s_{0} (z | S_{y_{0}}, ω) \\ + \frac{{k - π (x, ω)} \frac{\partial}{\partial ω} π (x, ω)}{π (x, ω) {1 - π (x, ω)}} + d (x, ω), \end{aligned}$ where $d (x, ω) = {f (x, ω)}^{- 1} (\partial f (x, ω) / \partial ω),$ $s_{1} (z | S_{y_{1}}, ω) = {h_{1} (z | S_{y_{1}}, ω)}^{- 1} (\partial h_{1} (z | S_{y_{1}}, ω) / \partial ω),$ $s_{0} (z | S_{y_{0}}, ω) = h_{0} (z | S_{y_{0}}, ω)^{- 1} (\partial h_{0} (z | S_{y_{0}}, ω) / \partial ω)$ . Therefore, the tangent space is equal to $\begin{aligned} T & = \{k s_{1} (z | S_{y_{1}}) + (1 - k) s_{0} (z | S_{y_{0}}) \\ + a (x) (k - π (x)) + d (x) : w h e r e (s_{0}, s_{1}, d, a) \\ s a t i s f i e s \int s_{j} (z | S_{y_{j}}) h_{j} (z | S_{y_{j}}) d y = 0, \\ \times \int d (x) f (x) d x = 0 a n d a (x) i s u n r e s t r i c t e d\} . \end{aligned}$ For the parametric submodel with parameter ω under consideration, $q_{k, τ} (ω)$ , the τ-th quantile for $Y_{k}$ , k = 0, 1, satisfies $0 = E_{ω} (1 {Y_{k} \leq q_{k, τ} (ω)} - τ) = \int \int (1 {z \leq q_{k, τ} (ω)} - τ) h_{k} (z | S_{Y_{k}}, ω) d z f (x, ω) d x$ . Let $θ (ω) = q_{1, τ} (ω) - q_{0, τ} (ω)$ , and remember $g_{k} (Y_{k}) = - (1 {Y_{k} \leq q_{k, τ}} - τ) / f_{k} (q_{k, τ})$ , k = 0, 1. By an application of Leibnitz's rule, $\begin{aligned} \frac{\partial θ (ω)}{\partial ω} & = \int \int g_{1} (z) s_{1} (z | S_{y_{1}}, ω) h_{1} (z | S_{y_{1}}, ω) f (x, ω) d z d x \\ + \int E_{ω} (g_{1} (z) - g_{0} (z) | X = x) d (x, ω) f (x, ω) d x \\ - \int \int g_{0} (z) s_{0} (z | S_{y_{0}}, ω_{0}) h_{0} (z | S_{y_{0}}) f (x, ω) d z d x . \end{aligned}$ Let $\begin{aligned} F (Z, T, X) & = \frac{T {g_{1} (Z) - E (g_{1} (Z) | S_{Y_{1}})}}{P (T = 1 | S_{Y_{1}})} \\ - \frac{(1 - T) {g_{0} (Z) - E (g_{0} (Z) | S_{Y_{0}})}}{1 - P (T = 1 | S_{Y_{0}})} \\ + E (g_{1} (Z) - g_{0} (Z) | X), \end{aligned}$ and the true parameter ω is $ω = ω_{0}$ , i.e. $θ = θ (ω_{0})$ , then we have (A2) $\begin{aligned} E \{F (Z, T, X) s (Z, T, X | ω_{0})\} \\ = E [\frac{T {g_{1} (Z) - E (g_{1} (Z) | S_{Y_{1}})}}{P (T = 1 | S_{Y_{1}})} s (Z, T, X | ω_{0})] \\ - E [\frac{(1 - T) {g_{0} (Z) - E (g_{0} (Z) | S_{Y_{0}})}}{1 - P (T = 1 | S_{Y_{0}})} s (Z, T, X | ω_{0})] \\ + E [E (g_{1} (Z) - g_{0} (Z) | X) s (Z, T, X | ω_{0})] . \end{aligned}$ (A2) For the three terms in (EquationA2(A2) $\begin{aligned} E \{F (Z, T, X) s (Z, T, X | ω_{0})\} \\ = E [\frac{T {g_{1} (Z) - E (g_{1} (Z) | S_{Y_{1}})}}{P (T = 1 | S_{Y_{1}})} s (Z, T, X | ω_{0})] \\ - E [\frac{(1 - T) {g_{0} (Z) - E (g_{0} (Z) | S_{Y_{0}})}}{1 - P (T = 1 | S_{Y_{0}})} s (Z, T, X | ω_{0})] \\ + E [E (g_{1} (Z) - g_{0} (Z) | X) s (Z, T, X | ω_{0})] . \end{aligned}$ (A2) ), after some algebra, we have, respectively, $\begin{aligned} E [\frac{T {g_{1} (Z) - E (g_{1} (Z) | S_{Y_{1}})}}{P (T = 1 | S_{Y_{1}})} s (Z, T, X | ω_{0})] \\ = E [{g_{1} (Y_{1}) - E (g_{1} (Y_{1}) | S_{Y_{1}})} s_{1} (Y_{1} | S_{Y_{1}}, ω_{0})], \\ \times E [\frac{(1 - T) {g_{0} (Z) - E (g_{0} (Z) | S_{Y_{0}})}}{1 - P (T = 1 | S_{Y_{0}})} s (Z, T, X | ω_{0})] \\ = E [{g_{0} (Z) - E (g_{0} (Z) | S_{Y_{0}})} s_{0} (Y_{0} | S_{Y_{0}}, ω_{0})], \\ \times E [{E (g_{1} (Z) - g_{0} (Z) | X)} s (Z, T, X | ω_{0})] \\ = E {E (g_{1} (Y_{1}) - g_{0} (Y_{0}) | X) d (X, ω_{0})} . \end{aligned}$ Therefore, $E {F (Z, T, X) s (Z, T, X | ω_{0})} = \partial θ (ω_{0}) / \partial ω .$ The efficiency bound is the expected square of the projection of F on $T$ . Because $F \in T$ , the projection of F on $T$ is itself. The conclusion follows.

For the second part of Lemma A.1, suppose $S_{0}$ , $S_{1}$ satisfy $L (S_{0}) \supseteq L (S_{Y_{0}})$ , $L (S_{1}) \supseteq L (S_{Y_{1}})$ . Since $\begin{aligned} V_{S_{Y_{1}}, S_{Y_{0}}}^{*} & = V a r {E (g_{1} (Y_{1}) | X) - E (g_{0} (Y_{0}) | X)} \\ + E \{\frac{V a r (g_{1} (Y_{1}) | X)}{P (T = 1 | S_{Y_{1}})}\} + E \{\frac{V a r (g_{0} (Y_{0}) | X)}{P (T = 0 | S_{Y_{0}})}\}, \\ V_{S_{1}, S_{0}}^{*} & = V a r {E (g_{1} (Y_{1}) | X) - E (g_{0} (Y_{0}) | X)} \\ + E \{\frac{V a r (g_{1} (Y_{1}) | X)}{P (T = 1 | S_{1})}\} + E \{\frac{V a r (g_{0} (Y_{0}) | X)}{P (T = 0 | S_{0})}\} . \end{aligned}$ We only need to prove $E \{\frac{V a r (g_{1} (Y_{1}) | X)}{P (T = 1 | S_{Y_{1}})}\} \leq E \{\frac{V a r (g_{1} (Y_{1}) | X)}{P (T = 1 | S_{1})}\} .$ By Jensen's inequity, we have $\frac{1}{E {E (T | S_{1}) | S_{Y_{1}}}} \leq E \{\frac{1}{E (T | S_{1})}| S_{Y_{1}}\} .$ Thus the conclusion follows from the inequality below. $\begin{aligned} E [\frac{V a r (g_{1} (Y_{1}) | X)}{P (T = 1 | S_{Y_{1}})}] & = E [\frac{V a r (g_{1} (Y_{1}) | X)}{E {E (T | S_{1}) | S_{Y_{1}}}}] \\ \leq E [V a r (g_{1} (Y_{1}) | X) E \{\frac{1}{E (T | S_{1})}| S_{Y_{1}}\}] \\ = E [E \{\frac{V a r (g_{1} (Y_{1}) | X)}{E (T | S_{1})}| S_{Y_{1}}\}] \\ = E \{\frac{V a r (g_{1} (Y_{1}) | X)}{P (T = 1 | S_{1})}\} . \end{aligned}$

Appendix 2

Conditions for Theorem 2.1

(C1)	$Y_{k}$ is a continuous random variable and for any fixed $τ \in (0, 1)$ there exists a unique $q_{k, τ}$ that $P (Y_{k} \leq q_{k, τ}) = τ$ for k = 0, 1.
(C2)	$π_{k} (S_{k})$ is bounded away from 0 and 1.
(C3)	$S_{k}$ has compact support for k = 0, 1.
(C4)	The function $π_{k} (S_{k})$ , the density function $f (S_{k})$ and $E (1 {Y_{k} \leq q_{k, τ}} \| S_{k})$ all have bounded partial derivatives with respect to $S_{k}$ up to $r_{k}$ order, $f (S_{k}) π_{k} (S_{k})$ is bounded away from 0.
(C5)	The kernel $K_{k}$ is bounded up to second order derivative.
(C6)	The smoothing bandwidth $h_{k n}$ satisfies $n h_{k n}^{2} \to \infty$ , $n h_{k n}^{d_{k}} \to \infty$ and $\sqrt{n} h_{k n}^{r_{k}} \to 0$ as $n \to \infty$ . Here $r_{k}$ is the order of the kernel $K_{k}$ .

Appendix 3

Proof of Theorem 2.1

Proof

Proof of Theorem 2.1

(i) In the case that $S_{k} = X$ , Firpo (Citation2007) proved the asymptotics of ${\hat{θ}}_{I P W}$ using kernel method, Chen et al. (Citation2015) proved the asymptotics of ${\hat{θ}}_{R E G}$ using kernel method. Following the proofs in their papers and substituting X by $S_{k}$ , $\sqrt{n} (\hat{θ} (S_{0}, S_{1}) - θ)$ is asymptotically equivalent to (A3) $\begin{aligned} \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} [\frac{T_{i} g_{1} (Z_{i})}{π_{1} (S_{1 i})} - \frac{E (g_{1} (Y_{1}) | S_{1 i}) {T_{i} - π_{1} (S_{1 i})}}{π_{1} (S_{1 i})}] \\ - \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} [\frac{(1 - T_{i}) g_{0} (Z_{i})}{π_{0} (S_{0 i})} \\ - \frac{E (g_{0} (Y_{0}) | S_{0 i}) {(1 - T_{i}) - π_{0} (S_{0 i})}}{π_{0 i} (S_{0 i})}] + o_{p} (1), \end{aligned}$ (A3) where $S_{k i}$ is the ith observation of $S_{k}$ , $π_{k} (S_{k i}) = P (T = k | S_{k} = S_{k i})$ for k = 0, 1. By direct but tedious calculation, the covariance of the two summation terms in (EquationA3(A3) $\begin{aligned} \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} [\frac{T_{i} g_{1} (Z_{i})}{π_{1} (S_{1 i})} - \frac{E (g_{1} (Y_{1}) | S_{1 i}) {T_{i} - π_{1} (S_{1 i})}}{π_{1} (S_{1 i})}] \\ - \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} [\frac{(1 - T_{i}) g_{0} (Z_{i})}{π_{0} (S_{0 i})} \\ - \frac{E (g_{0} (Y_{0}) | S_{0 i}) {(1 - T_{i}) - π_{0} (S_{0 i})}}{π_{0 i} (S_{0 i})}] + o_{p} (1), \end{aligned}$ (A3) ) is $- E (g_{0} (Y_{0})) E (g_{1} (Y_{1})) + E {E (g_{0} (Y_{0}) | S_{0}) E (g_{1} (Y_{1}) | S_{1})}$ . Their corresponding variances are $\begin{aligned} V a r [\frac{T g_{1} (Z)}{π_{1} (S_{1})} - \frac{E (g_{1} (Y_{1}) | S_{1})}{π_{1} (S_{1})} {T - π_{1} (S_{1})}] \\ = V a r {E (g_{1} (Y_{1}) | S_{1})} + E \{\frac{V a r (g_{1} (Y_{1}) | S_{1})}{π_{1} (S_{1})}\}, \\ V a r [\frac{(1 - T) g_{0} (Y_{0})}{π_{0} (S_{0})} - \frac{E (g_{0} (Y_{0}) | S_{0})}{π_{0} (S_{0})} {(1 - T) - π_{0} (S_{0})}] \\ = V a r {E (g_{0} (Y_{0}) | S_{0})} + E \{\frac{V a r (g_{0} (Y_{0}) | S_{0})}{π_{0} (S_{0})}\} . \end{aligned}$ Thus the asymptotic variance of $\hat{θ} (S_{0}, S_{1})$ is $\begin{aligned} V a r \{E (g_{1} (Y_{1}) | S_{1}) - E (g_{0} (Y_{0}) | S_{0})\} \\ + E \{\frac{V a r (g_{1} (Y_{1}) | S_{1})}{π_{1} (S_{1})}\} + E \{\frac{V a r (g_{0} (Y_{0}) | S_{0})}{π_{0} (S_{0})}\} . \end{aligned}$ (ii) Here we only list the proof for regression type estimator ${\hat{θ}}_{R E G}$ with $d_{0} = d_{1} = 1$ . We only derive the difference of ${\hat{q}}_{1, τ}$ between using true $B_{k}$ and estimated $B_{k}$ for regression estimator, the proof for the ${\hat{q}}_{0, τ}$ is similar. For simplicity, we denote $S_{1}$ , $B_{1}$ , $h_{1 n}$ , $K_{1}$ , $g_{1} (\cdot)$ , $π_{1} (\cdot)$ as S, B, h, $K$ , $g (\cdot)$ , $π (\cdot)$ respectively and define $K_{h} (\cdot) = h^{- 1} K (\cdot / h)$ in the following proof. Let $Δ_{i j} = K_{h} ({\hat{B}}^{⊤} X_{j} - {\hat{B}}^{⊤} X_{i}) - K_{h} (B^{⊤} X_{j} - B^{⊤} X_{i})$ , it can be verified that $\begin{aligned} \frac{1}{n} \sum_{i = 1}^{n} \{\frac{\sum_{j = 1}^{n} T_{j} g (Z_{j}) K_{h} ({\hat{B}}^{⊤} X_{j} - {\hat{B}}^{⊤} X_{i})}{\sum_{j = 1}^{n} T_{j} K_{h} ({\hat{B}}^{⊤} X_{j} - {\hat{B}}^{⊤} X_{i})} \\ - \frac{\sum_{j = 1}^{n} T_{j} g (Z_{j}) K_{h} (B^{⊤} X_{j} - B^{⊤} X_{i})}{\sum_{j = 1}^{n} T_{j} K_{h} (B^{⊤} X_{j} - B^{⊤} X_{i})}\} \\ = \frac{1}{n} \sum_{i = 1}^{n} \{\frac{\sum_{j = 1}^{n} T_{j} g (Z_{j}) K_{h} (S_{j} - S_{i}) + \sum_{j = 1}^{n} T_{j} g (Z_{j}) Δ_{i j}}{\sum_{j = 1}^{n} T_{j} K_{h} (S_{j} - S_{i}) + \sum_{j = 1}^{n} T_{j} Δ_{i j}} \\ - \frac{\sum_{j = 1}^{n} T_{j} g (Z_{j}) K_{h} (S_{j} - S_{i})}{\sum_{j = 1}^{n} T_{j} K_{h} (S_{j} - S_{i})}\} \\ \equiv A_{1} + A_{2} + A_{3}, \end{aligned}$ where $\begin{aligned} A_{1} & = \frac{1}{n^{2}} \sum_{i = 1}^{n} \sum_{j = 1}^{n} \{\frac{T_{j} g (Z_{j}) Δ_{i j}}{π (S_{i}) f (S_{i})} - \frac{T_{j} E (g (Z_{i}) | S_{i}) Δ_{i j}}{π (S_{i}) f (S_{i})}\}, \\ A_{2} & = - \frac{1}{n^{2}} \sum_{i = 1}^{n} \sum_{j = 1}^{n} \{\frac{T_{j} g (Z_{j}) Δ_{i j}}{π (S_{i}) f (S_{i})} \\ - \frac{T_{j} g (Z_{j}) Δ_{i j}}{\frac{1}{n} \sum_{l = 1}^{n} T_{l} K_{h} (S_{l} - S_{i}) + \frac{1}{n} \sum_{l = 1}^{n} T_{l} Δ_{i l}}\}, \\ A_{3} & = \frac{1}{n^{2}} \sum_{i = 1}^{n} \sum_{j = 1}^{n} T_{j} Δ_{i j} \{\frac{E (g (Z_{i}) | S_{i})}{π (S_{i}) f (S_{i})} \\ - \frac{E (g (Z_{i}) | S_{i})}{\frac{1}{n} \sum_{l = 1}^{n} T_{l} K_{h} (S_{l} - S_{i}) + \frac{1}{n} \sum_{l = 1}^{n} T_{l} Δ_{i l}} \\ + \frac{E (g (Z_{i}) | S_{i}) - \frac{\sum_{l = 1}^{n} T_{l} g (Z_{l}) K_{h} (S_{l} - S_{i})}{\sum_{l = 1}^{n} T_{l} K_{h} (S_{l} - S_{i})}}{\frac{1}{n} \sum_{l = 1}^{n} T_{l} K_{h} (S_{l} - S_{i}) + \frac{1}{n} \sum_{l = 1}^{n} T_{l} Δ_{i l}}\} . \end{aligned}$ Since $Δ_{i j} = K_{h} ({\hat{B}}^{⊤} X_{j} - {\hat{B}}^{⊤} X_{i}) - K_{h} (B^{⊤} X_{j} - B^{⊤} X_{i})$ , using a Taylor expansion around $B^{⊤} X_{j} - B^{⊤} X_{i}$ for $Δ_{i j}$ and plugging in $A_{1}$ , we have $\begin{aligned} A_{1} & = \frac{(\hat{B} - B)^{⊤}}{n^{2}} \sum_{i = 1}^{n} \sum_{j = 1}^{n} \{\frac{T_{j} {g (Z_{j}) - E (g (Z_{i}) | S_{i})}}{π (S_{i}) f (S_{i})} \\ \times \frac{1}{h} [K^{'} (\frac{B^{⊤} X_{j} - B^{⊤} X_{i}}{h}) \frac{X_{j} - X_{i}}{h}]\} + o_{p} (n^{- 1 / 2}) \\ \equiv \frac{(\hat{B} - B)^{⊤}}{n^{2}} \sum_{i = 1}^{n} \sum_{j = 1}^{n} Q_{i j} + o_{p} (n^{- 1 / 2}) . \end{aligned}$ Denote $A_{11} = \sum_{i = 1}^{n} \sum_{j = 1}^{n} Q_{i j} / n^{2}$ and ${\overset{˘}{A}}_{11} = \sum_{i = 1}^{n} \sum_{j = 1}^{n} E (Q_{i j} | X_{i}, g (Z_{i}), T_{i}) / n^{2}$ . Note $\begin{aligned} E \{\frac{1}{h} T_{j} K^{'} (\frac{S_{j} - S_{i}}{h}) (\frac{X_{j} - X_{i}}{h})| (X_{i}, Z_{i}, T_{i}) = (x_{i}, z_{i}, t_{i})\} \\ = E [E \{\frac{1}{h} T_{j} K^{'} (\frac{S_{j} - s_{i}}{h}) \frac{X_{j} - x_{i}}{h}| S_{j}\}] \\ = E [\frac{1}{h^{2}} K^{'} (\frac{S_{j} - s_{i}}{h}) E {T_{j} (X_{j} - x_{i}) | S_{j}}] \\ = {- \frac{\partial [E {T (X - x_{i}) | S = t} f (t)]}{\partial t}|}_{t = s_{i}} + o_{p} (1) \\ = - E (T X | S = s_{i}) f^{'} (s_{i}) - x_{i} π (s_{i}) f^{'} (s_{i}) - x_{i} π^{'} (s_{i}) f (s_{i}) \\ + {\frac{\partial {E (T X | S = t)}}{\partial t}|}_{t = s_{i}} f (s_{i}) + o_{p} (1), \end{aligned}$ and $\begin{aligned} E \{\frac{1}{h} T_{j} g (Z_{j}) K^{'} (\frac{S_{j} - s_{i}}{h}) (\frac{X_{j} - x_{i}}{h})| (X_{i}, Z_{i}, T_{i}) = (x_{i}, z_{i}, t_{i})\} \\ = E [E \{\frac{1}{h} T_{j} g (Z_{j}) K^{'} (\frac{S_{j} - s_{i}}{h}) \frac{X_{j} - x_{i}}{h}| S_{j}\}] \\ = E [\frac{1}{h^{2}} K^{'} (\frac{S_{j} - s_{i}}{h}) E {T_{j} g (Z_{j}) (X_{j} - x_{i}) | S_{j}}] \\ = {- \frac{\partial {E (T g (Z) (X - x_{i}) | S = t) f (t)}}{\partial t}|}_{t = s_{i}} + o (1) \\ = - E (T g (Z) X | S = s_{i}) f^{'} (s_{i}) - x_{i} π (s_{i}) E (g (Z) | S = s_{i}) f^{'} (s_{i}) \\ + {\frac{\partial E (T g (Z) X | S = t)}{\partial t}|}_{t = s_{i}} f (s_{i}) \\ - x_{i} π^{'} (s_{i}) E (g (Z) | S = s_{i}) f (s_{i}) - x_{i} π (s_{i}) \\ \times \{\frac{\partial E (g (Z) | S = s_{i})}{\partial S}\} f (s_{i}) + o_{p} (1) . \end{aligned}$ Therefore, $\begin{aligned} {\overset{˘}{A}}_{11} & = \frac{1}{n} \sum_{i = 1}^{n} \{c o v (T X, g (Z) | S = s_{i}) f^{'} (s_{i}) \\ + {\frac{\partial E (T g (Z) X | S = t)}{\partial t}|}_{t = s_{i}} f (s_{i}) \\ {- \frac{\partial E (T X | S = t)}{\partial t}|}_{t = s_{i}} E (g (Z) | S = s_{i}) f (s_{i}) \\ - x_{i} π (s_{i}) \frac{\partial E (g (Z) | S = s_{i})}{\partial S} f (s_{i})\} + o_{p} (1) \\ = - \frac{1}{n} \sum_{i = 1}^{n} \{c o v (T X, g (Z) | S = s_{i}) f^{'} (s_{i}) \\ + {\frac{\partial c o v (T X, g (Z) | S = t)}{\partial t}|}_{t = s_{i}} f (s_{i}) \\ + E (T X | S = s_{i}) \frac{\partial E (g (Z) | S = s_{i})}{\partial S} f (s_{i}) \\ - x_{i} π (s_{i}) \frac{\partial E (g (Z) | S = s_{i})}{\partial S} f (s_{i})\} + o_{p} (1) \\ = (c_{1})_{p \times 1} + o_{p} (1), \end{aligned}$ where $\begin{aligned} c_{1} & = - E \{\frac{c o v (T X, g (Z) | S) f^{'} (S) + \frac{\partial c o v (T X, g (Z) | S)}{\partial S} f (S)}{π (S) f (S)}\} \\ + E [\frac{{E (T X | S) - X π (S)} \frac{\partial E (g (Z) | S)}{\partial S}}{π (S)}] \\ = E [\frac{\partial {π (S)^{- 1}}}{\partial S} c o v (T X, g (Z) | S) \\ - \frac{c o v (X, T | S)}{π (S)} \frac{\partial E (g (Z) | S)}{\partial S}] . \end{aligned}$ It can be seen that the first term in $c_{1}$ will equal to 0 if $Y_{1} ⊥ X | S$ , while the second term in $c_{1}$ will equal to 0 if $T ⊥ X | S$ . Thus when both $Y_{1} ⊥ X | S$ and $T ⊥ X | S$ hold, we will have $c_{1} = 0$ . Let $A_{11 j} = (1 / n) \sum_{i = 1}^{n} Q_{i j}, {\overset{˘}{A}}_{11 j} = (1 / n) \sum_{i = 1}^{n} E (Q_{i j} | X_{i}, g (Z_{i}), T_{i}))$ , we have $\begin{aligned} E (A_{11} - {\overset{˘}{A}}_{11})^{2} & = \frac{1}{n^{2}} \sum_{j = 1}^{n} E (A_{11 j} - {\overset{˘}{A}}_{11 j})^{2} + \frac{2}{n (n - 1)} \\ \times \sum_{j \neq k} E (A_{11 j} - {\overset{˘}{A}}_{11 j}) E (A_{11 k} - {\overset{˘}{A}}_{11 k}) \\ = \frac{1}{n} E (A_{11 j} - {\overset{˘}{A}}_{11 j})^{2} = \frac{1}{n} {E (A_{11 j}^{2}) - E ({\overset{˘}{A}}_{11 j}^{2})} \\ \leq \frac{1}{n} E (A_{11 j}^{2}) = o_{p} (1) . \end{aligned}$ Thus we have $A_{11} = c_{1} + o_{p} (1)$ , which leads to $\sqrt{n} A_{1} = c_{1}^{⊤} {\sqrt{n} (\hat{B} - B)} + o_{p} (1) .$ For $A_{2}$ , we also use a Taylor expansion for $Δ_{i j}$ : $\begin{aligned} A_{2} & \equiv - \frac{1}{n^{2}} \sum_{i = 1}^{n} \sum_{j = 1}^{n} \times \{\frac{T_{j} g (Z_{j}) Δ_{i j}}{π (S_{i}) f (S_{i})} \\ - \frac{T_{j} g (Z_{j}) Δ_{i j}}{\frac{1}{n} \sum_{l = 1}^{n} T_{l} K_{h} (S_{l} - S_{i}) + \frac{1}{n} \sum_{l = 1}^{n} T_{l} Δ_{i l}}\} \\ = - \frac{1}{n^{2}} \sum_{i = 1}^{n} \sum_{j = 1}^{n} \\ \times [\frac{T_{j} g (Z_{j})}{h} K^{'} (\frac{B^{⊤} X_{j} - B^{⊤} X_{i}}{h}) (\hat{B} - B)^{⊤} \\ \times (\frac{X_{j} - X_{i}}{h}) \{\frac{1}{π (S_{i}) f (S_{i})} \\ - \frac{1}{\begin{matrix} \frac{1}{n} \sum_{l = 1}^{n} T_{l} K_{h} (S_{l} - S_{i}) \\ + \frac{1}{n} \sum_{l = 1}^{n} T_{l} Δ_{i l} \end{matrix}}\}] + o_{p} (n^{- 1 / 2}) . \end{aligned}$ We then decompose $A_{2}$ by conditioning on index i, j, that is we define $\begin{aligned} {\overset{˘}{A}}_{2} & = - \frac{1}{n^{2}} \sum_{i = 1}^{n} \sum_{j = 1}^{n} \\ \times [\frac{T_{j} g (Z_{j})}{h} K^{'} (\frac{B^{⊤} X_{j} - B^{⊤} X_{i}}{h}) (\hat{B} - B)^{⊤} (\frac{X_{j} - X_{i}}{h}) \\ \times E \{\frac{1}{π (S_{i}) f (S_{i})} \\ - \frac{1}{\begin{matrix} \frac{1}{n} \sum_{l = 1}^{n} T_{l} K_{h} (S_{l} - S_{i}) \\ + \frac{1}{n} \sum_{l = 1}^{n} T_{l} Δ_{i l} \end{matrix}}| X_{i}, g (Z_{i}), T_{i}\}] . \end{aligned}$ Since $\begin{aligned} E \{\frac{1}{n} \sum_{l = 1}^{n} T_{l} K_{h} (S_{l} - S_{i}) | S_{i}\} \\ = π (S_{i}) f (S_{i}) + o_{p} (1), \\ E \{\frac{1}{n} \sum_{l = 1}^{n} T_{l} g (Z_{l}) K_{h} (S_{l} - S_{i}) | S_{i}\} \\ = π (S_{i}) E (g (Z_{i}) | S_{i}) f (S_{i}) + o_{p} (1), \end{aligned}$ using a similar decomposition method as $A_{1}$ , we can also show $\sqrt{n} A_{2} ⟹ p 0$ and $\sqrt{n} A_{3} ⟹ p 0$ . Thus we proved that $\begin{aligned} \frac{1}{n} \sum_{i = 1}^{n} \hat{E} [g (Y_{1 i}) | {\hat{S}}_{i}] - \frac{1}{n} \sum_{i = 1}^{n} \hat{E} [g (Y_{1 i}) | S_{i}] \\ = \frac{1}{n} \sum_{i = 1}^{n} \{\frac{\sum_{j = 1}^{n} T_{j} g (Z_{j}) K_{h} ({\hat{S}}_{j} - {\hat{S}}_{i})}{\sum_{j = 1}^{n} T_{j} K_{h} ({\hat{S}}_{j} - {\hat{S}}_{i})} \\ - \frac{\sum_{j = 1}^{n} T_{j} g (Z_{j}) K_{h} (S_{j} - S_{i})}{\sum_{j = 1}^{n} T_{j} K_{h} (S_{j} - S_{i})}\} \\ = c_{1}^{⊤} (\hat{B} - B) + o_{p} (1 / \sqrt{n}) . \end{aligned}$ Note that the REG estimator for $q_{1, τ}$ based on estimated S is: $\begin{aligned} {\hat{q}}_{1, τ} & = a r g m i n \frac{1}{n} \sum_{i = 1}^{n} \hat{E} \{(Y_{1 i} - t) (τ - 1 {Y_{1 i} \leq t}) | \hat{S_{i}}\} \\ = a r g m i n \sum_{i = 1}^{n} (\hat{E} [(1 {Y_{1 i} \leq q_{1, τ}} - τ) (t - q_{1, τ}) | {\hat{S}}_{i}] \\ + \hat{E} [(Y_{1 i} - t) (1 {Y_{1 i} \leq q_{1, τ}} - 1 {Y_{1 i} \leq t}) | {\hat{S}}_{i}]) . \end{aligned}$ Let $u = \sqrt{n} (t - q_{1, τ})$ , $\hat{u} = \sqrt{n} ({\hat{q}}_{1, τ} - q_{1, τ})$ , the optimisation will change to $\begin{aligned} \hat{u} & = a r g m i n \{\sum_{i = 1}^{n} \frac{u}{\sqrt{n}} \hat{E} [(1 {Y_{1 i} \leq q_{1, τ}} - τ) | {\hat{S}}_{i}] \\ + \sum_{i = 1}^{n} \hat{E} [(Y_{1 i} - (q_{1, τ} + u / \sqrt{n})) (1 {Y_{1 i} \leq q_{1, τ}} \\ - 1 {Y_{1 i} \leq q_{1, τ} + u / \sqrt{n}}) | {\hat{S}}_{i}]\} \end{aligned}$ Similar with the proof in Firpo (Citation2007), one may check that the second term equals to $n ((f_{1} (q_{1, τ}) / 2) u^{2} + o_{p} (1))$ . Hence we have $\begin{aligned} \hat{u} & = \sqrt{n} ({\hat{q}}_{1, τ} - q_{1, τ}) \\ = \sqrt{n} \{- \frac{1}{n f_{1} (q_{1, τ})} \sum_{i = 1}^{n} \hat{E} [(1 {Y_{1 i} \leq q_{1, τ}} - τ) | {\hat{S}}_{i}]\} \\ = \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} \hat{E} [g (Y_{1 i}) | {\hat{S}}_{i}] \\ = \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} \hat{E} [g (Y_{1 i}) | S_{i}] + c_{1}^{⊤} \sqrt{n} {(\hat{B} - B)} + o_{p} (1) \\ = \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} [\frac{T_{i} g_{1} (Z_{i})}{π_{1} (S_{1 i})} - \frac{E (g_{1} (Y_{1}) | S_{1 i}) {T_{i} - π_{1} (S_{1 i})}}{π_{1} (S_{1 i})}] \\ + c_{1}^{⊤} \sqrt{n} {(\hat{B} - B)} + o_{p} (1) . \end{aligned}$ The last equation follows from the proof in Chen et al. (Citation2015), which is the linearisation for the REG estimator using true S. Repeat all above procedure for $q_{0, τ}$ , and plug in the linearisation for $(\hat{B} - B)$ , we could get the linearisation for $\hat{θ} (S_{0}, S_{1}) - θ$ , hence Theorem 2.1 is proved.

Appendix 4

Asymptotic variance comparisons between using $S_{m i n}$ and using $S_{Y_{0}, Y_{1}}$

We first prove that the asymptotic variance using $S_{T}$ will be larger than using X. Following the proof below, one may also easily prove using $S_{m i n}$ in De Luna et al. (Citation2011) is larger than $V_{S_{Y_{0}, Y_{1}}}^{*}$ unless $S_{m i n} = S_{Y_{0}, Y_{1}}$ , by replacing original covariate set X with $S_{Y_{0}, Y_{1}}$ and replacing $S_{T}$ with $S_{m i n}$ . Adapting the proof for Theorem 2.1, we can find that for any S that satisfies $T ⊥ (Y_{0}, Y_{1}) | S$ , the asymptotic variance for using $(S, S)$ in $\hat{θ}$ is $\begin{aligned} V a r {E (g (Y_{1}) | S) - E (g (Y_{0}) | S)} \\ + E \{\frac{V a r (g_{1} (Y_{1}) | S)}{π_{1} (S)}\} + E \{\frac{V a r (g_{0} (Y_{0}) | S)}{π_{0} (S)}\} . \end{aligned}$ Since $S_{T}$ satisfies $T ⊥ X | S_{T}$ , we also have $T ⊥ (Y_{0}, Y_{1}) | X$ thus $T ⊥ (Y_{0}, Y_{1}) | S_{T}$ . Therefore the asymptotic variance for $\hat{θ}$ using $S_{T}$ is: $\begin{aligned} V_{S_{T}} & = V a r {E (g_{1} (Y_{1}) | S_{T}) - E (g_{0} (Y_{0}) | S_{T})} \\ + E \{\frac{V a r (g_{1} (Y_{1}) | S_{T})}{π_{1} (S_{T})}\} + E \{\frac{V a r (g_{0} (Y_{0}) | S_{T})}{π_{0} (S_{T})}\} . \end{aligned}$ The asymptotic variance for $\hat{θ}$ using X is: $\begin{aligned} V_{X} & = V a r {E (g_{1} (Y_{1}) | X) - E (g_{0} (Y_{0}) | X)} \\ + E \{\frac{V a r (g_{1} (Y_{1}) | X)}{π_{1} (X)}\} + E \{\frac{V a r (g_{0} (Y_{0}) | X)}{π_{0} (X)}\} . \end{aligned}$ Therefore (A4) $\begin{aligned} V_{X} - V_{S_{T}} & = E [{π_{1}^{- 1} (X) - 1} V a r (g_{1} (Y_{1}) | X)] \\ - E [{π_{1}^{- 1} (S_{T}) - 1} V a r (g_{1} (Y_{1}) | S_{T})] \end{aligned}$ (A4) (A5) $\begin{aligned} + E [{π_{0}^{- 1} (X) - 1} V a r (g_{0} (Y_{0}) | X)] \\ - E [{π_{0}^{- 1} (S_{T}) - 1} V a r (g_{0} (Y_{0}) | S_{T})] \end{aligned}$ (A5) (A6) $\begin{aligned} + 2 E \{E (g_{1} (Y_{1}) | S_{T}) E (g_{0} (Y_{0}) | S_{T})\} \\ - 2 E \{E (g_{1} (Y_{1}) | X) E (g_{0} (Y_{0}) | X)\} . \end{aligned}$ (A6) Let $\begin{aligned} a_{1} (S_{T}) & = \sqrt{\frac{1}{π_{1} (X)} - 1} = \sqrt{\frac{1}{π_{1} (S_{T})} - 1}, a_{0} (S_{T}) \\ = \sqrt{\frac{1}{π_{0} (X)} - 1} = \sqrt{\frac{1}{π_{0} (S_{T})} - 1} . \end{aligned}$ The expression (EquationA4(A4) $\begin{aligned} V_{X} - V_{S_{T}} & = E [{π_{1}^{- 1} (X) - 1} V a r (g_{1} (Y_{1}) | X)] \\ - E [{π_{1}^{- 1} (S_{T}) - 1} V a r (g_{1} (Y_{1}) | S_{T})] \end{aligned}$ (A4) ) equals $\begin{aligned} E \{v a r (a_{1} g_{1} (Y_{1}) | X) - v a r (a_{1} g_{1} (Y_{1}) | S_{T})\} \\ = - V a r \{E (a_{1} g_{1} (Y_{1}) | X)\} + V a r \{E (a_{1} g_{1} (Y_{1}) | S_{T})\} \\ = - E [V a r \{E (a_{1} g_{1} (Y_{1}) | X) | S_{T}\}] . \end{aligned}$ Similarly, the expression (EquationA5(A5) $\begin{aligned} + E [{π_{0}^{- 1} (X) - 1} V a r (g_{0} (Y_{0}) | X)] \\ - E [{π_{0}^{- 1} (S_{T}) - 1} V a r (g_{0} (Y_{0}) | S_{T})] \end{aligned}$ (A5) ) equals $- E [V a r {E (a_{0} g_{0} (Y_{0}) | X) | S_{T}}],$ and the expression (EquationA6(A6) $\begin{aligned} + 2 E \{E (g_{1} (Y_{1}) | S_{T}) E (g_{0} (Y_{0}) | S_{T})\} \\ - 2 E \{E (g_{1} (Y_{1}) | X) E (g_{0} (Y_{0}) | X)\} . \end{aligned}$ (A6) ) equals $- 2 E [c o v {E (g_{0} (Y_{0}) | X), E (g_{1} (Y_{1}) | X) | S_{T}}] .$ Since $a_{0} a_{1} = 1$ , therefore $\begin{aligned} V_{X} - V_{S_{T}} & = - E (V a r [{a_{1} E (g_{1} (Y_{1}) | X) \\ + a_{0} E (g_{0} (Y_{0}) | X)} | S_{T}]) \leq 0, \end{aligned}$ which completes the proof.

See Figure for difference choices of $S_{k}$ and Figure A1 for the comparisons of efficiency of estimator $\hat{θ}$ based on different $S_{k}$ , $k = 0, 1$ .

Appendix 5

Asymptotic variance comparisons between using $S_{Y_{0}, Y_{1}}$ and using $S_{Y_{k}, T}$

In this section, we prove that the asymptotic variance of $\hat{θ} (S_{Y_{0}, Y_{1}}, S_{Y_{0}, Y_{1}})$ is smaller than asymptotic variance of $\hat{θ} (S_{Y_{0}, T}, S_{Y_{0}, T})$ , followed by those of $\hat{θ} (S_{T}, S_{T})$ . From the proof of Theorem 2.1, for all $S_{k}$ satisfying $T ⊥ Y_{k} | S_{k}$ , the asymptotic variance of $\hat{θ} (S_{1}, S_{0})$ is $\begin{aligned} V a r \{E (g_{1} (Y_{1}) | S_{1}) - E (g_{0} (Y_{0}) | S_{0})\} \\ + E \{\frac{V a r (g_{1} (Y_{1}) | S_{1})}{π_{1} (S_{1})}\} + E \{\frac{V a r (g_{0} (Y_{0}) | S_{0})}{π_{0} (S_{0})}\} . \end{aligned}$ For $\hat{θ} (S_{Y_{0}, Y_{1}}, S_{Y_{0}, Y_{1}})$ and $\hat{θ} (S_{Y_{0}, T}, S_{Y_{1}, T})$ , the asymptotic variances $V_{S_{Y_{0}, Y_{1}}}$ and $V_{S_{Y_{0}, T}, S_{Y_{1}, T}}^{*}$ are $\begin{aligned} V_{S_{Y_{0}, Y_{1}}}^{*} \\ = V a r \{E (g_{1} (Y_{1}) | S_{Y_{0}, Y_{1}}) - E (g_{0} (Y_{0}) | S_{Y_{0}, Y_{1}})\} \\ + E \{\frac{V a r (g_{1} (Y_{1}) | S_{Y_{0}, Y_{1}})}{π_{1} (S_{Y_{0}, Y_{1}})}\} + E \{\frac{V a r (g_{0} (Y_{0}) | S_{Y_{0}, Y_{1}})}{π_{0} (S_{Y_{0}, Y_{1}})}\} \\ = V a r \{E (g_{1} (Y_{1}) | S_{Y_{1}}) - E (g_{0} (Y_{0}) | S_{Y_{0}})\} \\ + E \{\frac{V a r (g_{1} (Y_{1}) | S_{Y_{1}})}{π_{1} (S_{Y_{0}, Y_{1}})}\} + E \{\frac{V a r (g_{0} (Y_{0}) | S_{Y_{0}})}{π_{0} (S_{Y_{0}, Y_{1}})}\} \\ V_{S_{Y_{0}, T}, S_{Y_{1}, T}}^{*} \\ = V a r \{E (g_{1} (Y_{1}) | S_{Y_{1}, T}) - E (g_{0} (Y_{0}) | S_{Y_{0}, T})\} \\ + E \{\frac{V a r (g_{1} (Y_{1}) | S_{Y_{1}, T})}{π_{1} (S_{Y_{1}, T})}\} + E \{\frac{V a r (g_{0} (Y_{0}) | S_{Y_{0}, T})}{π_{0} (S_{Y_{0}, T})}\} \\ = V a r \{E (g_{1} (Y_{1}) | S_{Y_{1}}) - E (g_{0} (Y_{0}) | S_{Y_{0}})\} \\ + E \{\frac{V a r (g_{1} (Y_{1}) | S_{Y_{1}})}{π_{1} (S_{Y_{1}, T})}\} + E \{\frac{V a r (g_{0} (Y_{0}) | S_{Y_{0}})}{π_{0} (S_{Y_{0}, T})}\} \\ = V a r \{E (g_{1} (Y_{1}) | S_{Y_{1}}) - E (g_{0} (Y_{0}) | S_{Y_{0}})\} \\ + E \{\frac{V a r (g_{1} (Y_{1}) | S_{Y_{1}})}{π_{1} (S_{T})}\} + E \{\frac{V a r (g_{0} (Y_{0}) | S_{Y_{0}})}{π_{0} (S_{T})}\} \end{aligned}$ Thus we only need to prove $E \{\frac{V a r (g_{1} (Y_{1}) | S_{Y_{1}})}{π_{1} (S_{T})}\} \geq E \{\frac{V a r (g_{1} (Y_{1}) | S_{Y_{1}})}{π_{1} (S_{Y_{0}, Y_{1}})}\}$ By Jensen's inequity, $\begin{aligned} E \{\frac{V a r (g_{1} (Y_{1}) | S_{Y_{1}})}{π_{1} (S_{T})}\} = E \{\frac{V a r (g_{1} (Y_{1}) | S_{Y_{1}})}{π_{1} (X)}\} \\ = E \{E [\frac{V a r (g_{1} (Y_{1}) | S_{Y_{1}})}{π_{1} (X)}| S_{Y_{0}, Y_{1}}]\} \\ = E \{V a r (g_{1} (Y_{1}) | S_{Y_{1}}) E [\frac{1}{π_{1} (X)}| S_{Y_{0}, Y_{1}}]\} \\ \geq E \{V a r (g_{1} (Y_{1}) | S_{Y_{1}}) \frac{1}{E [π_{1} (X) | S_{Y_{0}, Y_{1}}]}\} \\ = E \{\frac{V a r (g_{1} (Y_{1}) | S_{Y_{1}})}{π_{1} (S_{Y_{0}, Y_{1}})}\} \end{aligned}$ Hence $\hat{θ} (S_{Y_{0}, Y_{1}}, S_{Y_{0}, Y_{1}}) \to \hat{θ} (S_{Y_{0}, T}, S_{Y_{1}, T})$ . Note that from Lemma A.1 we have $V_{S_{Y_{0}, T}, S_{Y_{1}, T}}^{*} \leq V_{X, X}^{*}$ , i.e. $\hat{θ} (S_{Y_{0}, T}, S_{Y_{1}, T}) \to \hat{θ} (X, X)$ . Then the other result follows from $\hat{θ} (X, X) \to \hat{θ} (S_{T}, S_{T})$ , which is proved in the Appendix 4.

Figure A1. Five choices of $S_{k}$ in the space of all linear combinations of X. For $(S_{Y_{0}}, S_{Y_{1}})$ and $(S_{Y_{0}, T}, S_{Y_{1}, T})$ , the first row are $S_{0}$ for estimating $Y_{0}$ characteristics, the second row are $S_{1}$ for estimating $Y_{1}$ characteristics.

Figure A2. Relative efficiencies of estimators. Solid arrow from A to B means that A is more asymptotically efficient than B. Dashed arrow from A to B means that empirically A is more efficient than B.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Order Reprints Request Corporate Permissions

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

Request Academic Permissions

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.

Quantile treatment effect estimation with dimension reduction

Abstract

1. Introduction

2. Methods

3. Simulation

Table 1. Relative bias and standard deviation for simulation 1 with true or estimated $S_{0}$ and $S_{1}$ .

Table 2. Relative bias and standard deviation for simulation 2 with true or estimated $S_{0}$ and $S_{1}$ .

4. Real data analysis

Table 3. Estimates and standard errors (SE) for the University of Wisconsin Health ACO data.

Acknowledgments

Disclosure statement

Notes on contributors

Ying Zhang

Lei Wang

Menggang Yu

Jun Shao

References

Appendices

Appendix 1. Semiparametric efficiency bound of estimating θ with $S_{k}$

Proof of Lemma A.1

Appendix 2

Conditions for Theorem 2.1

Appendix 3

Proof of Theorem 2.1

Proof of Theorem 2.1

Appendix 4

Asymptotic variance comparisons between using $S_{m i n}$ and using $S_{Y_{0}, Y_{1}}$

Appendix 5

Asymptotic variance comparisons between using $S_{Y_{0}, Y_{1}}$ and using $S_{Y_{k}, T}$

Information for

Open access

Opportunities

Help and information

Quantile treatment effect estimation with dimension reduction

Abstract

1. Introduction

2. Methods

3. Simulation

Table 1. Relative bias and standard deviation for simulation 1 with true or estimated S0 and S1.

Table 2. Relative bias and standard deviation for simulation 2 with true or estimated S0 and S1.

4. Real data analysis

Table 3. Estimates and standard errors (SE) for the University of Wisconsin Health ACO data.

Acknowledgments

Disclosure statement

Additional information

Funding

Notes on contributors

Ying Zhang

Lei Wang

Menggang Yu

Jun Shao

References

Appendices

Appendix 1. Semiparametric efficiency bound of estimating θ with Sk

Proof of Lemma A.1

Appendix 2

Conditions for Theorem 2.1

Appendix 3

Proof of Theorem 2.1

Proof of Theorem 2.1

Appendix 4

Asymptotic variance comparisons between using Smin and using SY0,Y1

Appendix 5

Asymptotic variance comparisons between using SY0,Y1 and using SYk,T

Reprints and Corporate Permissions

Academic Permissions

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date

Table 1. Relative bias and standard deviation for simulation 1 with true or estimated $S_{0}$ and $S_{1}$ .

Table 2. Relative bias and standard deviation for simulation 2 with true or estimated $S_{0}$ and $S_{1}$ .

Appendix 1. Semiparametric efficiency bound of estimating θ with $S_{k}$

Asymptotic variance comparisons between using $S_{m i n}$ and using $S_{Y_{0}, Y_{1}}$

Asymptotic variance comparisons between using $S_{Y_{0}, Y_{1}}$ and using $S_{Y_{k}, T}$