ABSTRACT
Tang et al. (2003. Analysis of multivariate missing data with nonignorable nonresponse. Biometrika, 90(4), 747–764) and Zhao & Shao (2015. Semiparametric pseudo-likelihoods in generalized linear models with nonignorable missing data. Journal of the American Statistical Association, 110(512), 1577–1590) proposed a pseudo likelihood approach to estimate unknown parameters in a parametric density of a response Y conditioned on a vector of covariate X, where Y is subjected to nonignorable nonersponse, X is always observed, and the propensity of whether or not Y is observed conditioned on Y and X is completely unspecified. To identify parameters, Zhao & Shao (2015. Semiparametric pseudo-likelihoods in generalized linear models with nonignorable missing data. Journal of the American Statistical Association, 110(512), 1577–1590) assumed that X can be decomposed into U and Z, where Z can be excluded from the propensity but is related with Y even conditioned on U. The pseudo likelihood involves the estimation of the joint density of U and Z. When this density is estimated nonparametrically, in this paper we apply sufficient dimension reduction to reduce the dimension of U for efficient estimation. Consistency and asymptotic normality of the proposed estimators are established. Simulation results are presented to study the finite sample performance of the proposed estimators.
1. Introduction
Missing data or nonresponse is common in various statistical applications such as sample surveys and biomedical studies. Let Y be a univariate response variable subject to nonresponse, X be a vector of covariates that is always observed, and R be the indicator of whether Y is observed. When the propensity , missing data are ignorable and there is a rich literature on methodology of handling nonresponse (Little & Rubin, Citation2014). In many applications, however, Y cannot be excluded from the propensity
and missing data are nonignorable. With nonignorable nonresponse, it is challenging to estimate unknown characteristics in the conditional distribution of Y given X or the unconditional distribution of Y.
Throughout we use or
as a generic notation for the conditional or unconditional probability density with respect to an appropriate measure. When nonresponse is nonignorable and both
and
are nonparametric,
is not identificable (Robins & Ritov, Citation1997). When both
and
have parametric forms, maximum likelihood methods have been developed (Baker & Laird, Citation1988; Greenlees, Reece, & Zieschang, Citation1982). Since parametric methods are sensitive to model violations, efforts have been made under semiparametric models. Qin, Leung, and Shao (Citation2002) and Wang, Shao, and Kim (Citation2014) imposed a parametric model on
but allowed
to be nonparametric. Assuming
, i.e., the entire covariate vector X can be excluded from the propensity, Tang, Little, and Raghunathan (Citation2003) proposed a pseudo likelihood method in which
is nonparametric and
is parametric:
(1) where θ is an unknown parameter vector and
is a conditional density which is known when θ is known. Zhao and Shao (Citation2015) extended the pseudo likelihood method to the case where part of X can be excluded from the propensity, i.e.,
(2) where
and the covariate Z, termed as instrumental variable, cannot be excluded from
in (Equation1
(1) ). Here is a brief description of what has been done under (Equation1
(1) )–(Equation2
(2) ). Under (Equation1
(1) )–(Equation2
(2) ) and the Bayes formula,
(3) Let
,
, be n independent and identically distributed observations from
. Based on (Equation3
(3) ), if
is known, then we can estimate θ by maximising the following likelihood:
(4) Usually
is unknown. Substituting
in (Equation4
(4) ) by its estimate results in a pseudo likelihood. For example, Zhao and Shao (Citation2015) assumed a parametric model
for
and replaced
in (Equation4
(4) ) by
, where
is an estimator of η based on
,
, and
is the empirical distribution of
,
. To avoid model misspecification on
, Zhao and Shao (Citation2015) also suggested a nonparametric kernel estimator
to replace
in (Equation4
(4) ).
However, kernel estimation is unstable when the dimension of is not small. The purpose of our work is to propose an alternative way to handle
in (Equation4
(4) ), which adopts dimension reduction techniques to improve the resulting pseudo likelihood estimator of θ. Our main idea is described in the next section, along with the proposed pseudo likelihood and estimator of θ. Under typical conditions for kernel estimation, our proposed pseudo likelihood estimator is asymptotically normal with convergence rate
. We also perform some simulations to examine the finite sample properties of our proposed estimator.
2. Methodology and theory
As pointed out in the previous section, the main contribution of this paper is to estimate in the denominator of (Equation4
(4) ) in a more efficient way, especially when the covariate U is of high dimension. Often times, the dimension of the instrument variable Z is small, while the covariate U contains a lot of variables (demographic variables for instance), and its dimension p is not small. A straightforward idea is to split
as
, instead of
as in Zhao and Shao (Citation2015). Since
does not involve θ, the likelihood in (Equation4
(4) ) is equivalent to
(5) and our main task is to estimate the denominator in (Equation5
(5) ), i.e., the integral
(6) for arbitrarily given θ, y and u. Since the real form of
is not our main concern, for robustness, we adopt a nonparametric kernel regression, called Nadaraya-Watson (NW) estimation (Nadaraya, Citation1964; Watson, Citation1964), to estimate the conditional expectation in (Equation6
(6) ). The way we split
, along with the nonparametric estimation, actually frees ourselves from parameterising or modelling the relationship between Z and U.
Before building our estimators, we introduce a generic notation for a kernel with an appropriate dimension and bandwidth h, i.e.,
appeared in different places may be different. In what follows
is chosen to be a product kernel of dimension s and order
in the sense that
, where
is the jth component of the s-dimensional x and
is a bounded and Lipschitz continuous univariate kernel having a compact support and satisfying
,
is finite and nonzero, and
for all 0<l<m. The NW estimator of
in (Equation6
(6) ) is
(7) Substituting this estimator into (Equation5
(5) ) leads to a maximum pseudo likelihood estimator of θ,
(8) Although p, the dimension of U, does not exert a direct influence on the convergence rate of
as shown in Theorem 2.1, it is well known in the literature that kernel estimators do not perform well when p is very large. For example, their convergence requires a very large sample size n.
Fortunately, there is a well-developed nonparametric method called Sufficient Dimension Reduction (SDR) (Cook & Weisberg, Citation1991; Li & Wang, Citation2007; Li, Citation1991; Ma & Zhu, Citation2012; Xia, Tong, Li, & Zhu, Citation2002), which helps to reduce the dimension of predictors by finding a matrix B with the smallest possible
such that
, meaning that Z is only related to U via
, where
is the transpose of B. It is common that d<p and we are able to improve the estimation in (Equation7
(7) ) and (Equation8
(8) ) by applying NW estimation directly to Z and
with
being an SDR estimator of B. Starting from sliced inverse regression (SIR) (Li, Citation1991), research has been done in the literature to develop SDR estimators of B, including sliced average variance estimation (SAVE) (Cook & Weisberg, Citation1991), directional regression (DR) (Li & Wang, Citation2007), (conditional) minimum average variance estimation (MAVE ) (Xia et al., Citation2002), semiparametric approach to dimension reduction (Ma & Zhu, Citation2012) and etc. We adopt SIR method developed by Li (Citation1991) to estimate B in our simulation studies in Section 3 because it is easy to implement and works out well in practice.
When ,
in (Equation6
(6) ) is equal to
, which can be estimated by a new kernel estimator
Then, a new maximum pseudo likelihood estimator of θ,
(9) In the rest of this section we establish asymptotic consistency and normality of our proposed estimator
in (Equation8
(8) ) and
in (Equation9
(9) ). We first introduce some notation. Let
and
be the first derivative with respect to θ. For the estimator without SDR, denote
(10) and the kernel estimator of γ is denoted by
(11) Therefore, maximising the pseudo likelihood in (Equation8
(8) ) is the same as maximising the pseudo log-likelihood
(12) Differentiating
with respect to θ, we obtain the score function
(13) Thus,
can also be obtained by solving
.
We use the same notation γ as in (Equation10(10) ) for the estimator with SDR,
(14) Its kernel estimator is
(15) The pseudo log-likelihood and score function of the
can be obtained by replacing
in (Equation12
(12) ) and (Equation13
(13) ) by
.
As the structure of two etimators are alike, we only provide the sufficient conditions for in Assumptions 2.1–2.2 where γ is defined as in (Equation10
(10) ). For
, however, Assumptions 2.1–2.2 need to be modified by replacing (Equation10
(10) ) by (Equation14
(14) ), (Equation11
(11) ) by (Equation15
(15) ) and p by d. In addition, in order for
to converge with rate
,
should also possess some asymptotic properties as indicated in Assumption 2.3 with γ defined in (Equation14
(14) ). In fact, SDR estimator
and kernel estimators
and
all have good asymptotic performances (Bierens, Citation1987; Li, Citation1991; Nadaraya, Citation1964; Watson, Citation1964).
Throughout, denotes the true but unknown value of θ.
Assumption 2.1
There exists a constant c>0 such that . The function
has bounded mth derivative with respect to u. There exists a q>1 such that
as
and
is bounded.
Let be the sup-norm. Under Assumption 2.1,
and
, which means that the estimator
converges uniformly to γ at a certain rate, i.e.,
(Hansen, Citation2008; Newey & McFadden, Citation1994).
Assumption 2.2
There exists
such that
where
is defined in (Equation12
(12) ).
for
with small enough
and
and
are bounded as functions of y and u.
For small enough
is continuously differentiable in θ on a neighbourhood of
where
is defined in (Equation13
(13) ). There exists a
with
such that
for an
.
exists and is nonsingular.
and
as
.
Assumptions 2.1–2.2 together guarantee that is consistent for
. As to
, there is an extra step of SDR estimation requiring Assumption 2.3.
Assumption 2.3
Let for some
where
is the dimension of z, and γ is as defined in (Equation14
(14) ).
Uniformly in Ω, The mth derivatives of
and
on
are locally Lipschitz-continuous as functions of
.
and each entry in the matrices
are locally Lipschitz-continuous and bounded from above as a functions of
.
in probability as
and
where
.
The bandwidth
for
.
Note that B is a matrix except when d=1, and when it comes to the calculation of derivatives or norms, we are treating B as , which denotes the vector formed by concatenating the columns of B. For simplicity, we are still using B to represent
. Similarly, we denote
as
.
Theorem 2.1
If Assumptions 2.1–2.2 hold. Then, as
in probability and
where
means convergence in distribution, and
If Assumptions 2.1–2.2 hold with (Equation10
(10) ), (Equation11
(11) ) and p replaced by (Equation14
(14) ), (Equation15
(15) ) and d, respectively, and if Assumption 2.3 also holds. Then, as
in probability and
where
It follows from Theorem 2.1 that the asymptotic variance of may or may not be smaller than that of
, partly due to the variability of SDR estimation. Owing to SDR, the dimension of kernel estimation is reduced from p to d, which is the main advantage of
over
, although they have the same convergence rate. A problem with a not so small p is the selection of bandwidth h for kernel estimation. Specifically, if
with a
, then from Assumptions 2.1 and 2.2(iv), τ must be between
and
for
, where m is the order of kernel. When p is not so small, the upper bound
is pretty small, and we may need to increase the kernel order m in order to find such τ satisfying the constraints. If m>2, the kernel takes negative values, which may reduce the stability of kernel estimation. For example, if q=2 in Assumption 2.1, then we must have
and we cannot use m=2 when p>3; if p=8, then we must use a kernel of order m=5. On the other hand, if we reduce U to
with dimension d, the upper bound increases to
, allowing enough flexibility in choosing τ to be larger than the lower bound
, which results in a low-order kernel. If q=2, then we may use m=2 when
.
Although both and
have convergence rate
,
may still have an edge over
in finite sample performance, because of smaller dimension and order used in the kernel estimation. This is supported by the simulation results in Section 3.
3. Simulation studies
We study the finite-sample performance of the proposed pseudo likelihood estimators in two simulation studies. Four estimators are compared: the estimator based on full data assuming no missing data (), the estimator based on complete case analysis (
), the maximum pseudo likelihood estimator without dimensional reduction (
defined in (Equation8
(8) )), and the maximum pseudo likelihood estimator with dimensional reduction (
defined in (Equation9
(9) )). The NW estimations are computed using a Gaussian kernel
. The bandwidths are selected by rule of thumb proposed by Silverman (Citation1986), i.e.,
, where
is the estimated standard deviation of U. For SDR, we apply SIR in Li (Citation1991) with the number of slices
. All results are based on 1000 simulation replications and sample size n=400.
3.1. Experiment 1
In the first experiment, , where
and
. Hence, the dimension p equals 6. To evaluate the performance of SDR, two models of
are studied:
and
, where d equals 1 and 2, respectively. The response Y is generated as
, where
. The propensity
, resulting in unconditional response rates 73% and 74% in two cases, respectively. Table reports the estimators and their standard deviations.
Table 1. Simulation results of experiment 1.
The simulation results can be summarised as follows. First, is nearly unbiased for θ no matter d equals 1 or 2. Second, the bias of
may not be negligible and, in some cases, the bias is comparable to
that is biased in theory. As we discussed in Section 2, compared with
, the assumptions are harder to satisfy for
. Third, the standard deviations of
and
are quite close and, in some cases,
has slightly smaller standard deviation than that of
.
3.2. Experiment 2
In the second experiment, we consider a binary outcome Y . Covariate U is generated the same as that in the first experiment, , or
. The binary Y is generated according to
, where
. The propensity
, resulting unconditional response rates 71% or 72%. The simulation results are reported in Table . The conclusions are quite similar to those for experiment 1, but the bias of
is more serious: the bias of
are even larger than the bias of
. Meanwhile,
is still nearly unbiased.
Table 2. Simulation results of experiment 2.
4. Proofs
Proof of Theorem 2.1
We first prove part (i). Here, γ is defined as in (Equation10(10) ) and
is defined as in (Equation11
(11) ). When
, denote
as
. Note that, in Assumption 2.1, we assume that there exists a constant c>0 such that
. Moreover, the univariate kernel
is bounded, hence, there exists a constant b>0 such that
.
For the consistency of , we only need to verify the assumptions of Theorem 2(a) in Zhao and Shao (Citation2015). Note that Assumptions 2.1 and 2.2(i), (iv) guarantee that
is consistent. Then,
Hence,
Since
,
in probability as
. Therefore,
Following Theorem 2 in Zhao and Shao (Citation2015), with Assumption 2.2(i),
in probability as
, where
is the true value of θ.
For the asymptotic normality of , we only need to verify the assumptions of Theorem 8.11 in Newey and McFadden (Citation1994).
Step 1. Let , and
We would like to prove
Note that
(16) which follows from the Assumptions 2.1 and 2.2(ii).
Step 2. Let
We would like to prove
and
. For
, V-statistics method is applied. We use the result of Lemma 8.4 by Newey and McFadden (Citation1994) directly. In our case,
. Then
Let
, then
where
and
Then, following Assumption 2.2(ii) and (iv), we obtain that
Hence,
.
As to , by Chebychev's Inequality, since
,
which follows the Assumptions 2.1 and 2.2(ii).
Step 3. Note that
Let
,
and
Then,
We would like to prove
By Chebychev's Inequality, we only need to prove
So we only need to prove
and
. Let
Then,
Therefore, by Assumption 2.2(ii) and (iv),
. And
Note that,
for some constant C, and
Since when n is large enought, and
is bounded,
Then by conditional dominated convergence theorem,
By Dominated Convergence Theorem,
.
Then, we combine Step 1–3, based on Theorems 8.11 and 8.12 in Newey and McFadden (Citation1994), with Assumption 2.2(iii), and
.
Second, we prove part (ii). Here, γ is defined as in (Equation10(10) ) and
is defined as in (Equation15
(15) ). We consider the normality of
. When
is replaced by the true SDR direction B, the proof is exactly the same as in part (i). Then we could easily get that
. Now we only need to consider
Based on Lemma 3 in Ma and Zhu (Citation2012), with Assumption 2.3(i) and (iii),
And, with Assumption 2.3(ii), we can easily prove the normality of
Then we finish the proof of part (ii).
Disclosure statement
No potential conflict of interest was reported by the authors.
Additional information
Funding
Notes on contributors
Ji Chen
Ji Chen is a PhD candidate in East China Normal University.
Bingying Xie
Bingying Xie is a statistician at Roche in Shanghai, China.
Jun Shao
Jun Shao is a professor in University of Wisconsin-Madison.
References
- Baker, S. G., & Laird, N. M. (1988). Regression analysis for categorical variables with outcome subject to nonignorable nonresponse. Journal of the American Statistical association, 83(401), 62–69. doi: 10.1080/01621459.1988.10478565
- Bierens, H. J. (1987). Kernel estimators of regression functions. In Advances in econometrics: Fifth world congress (Vol. 1, pp. 99–144).
- Cook, R. D., & Weisberg, S. (1991). Comment. Journal of the American Statistical Association, 86(414), 328–332.
- Greenlees, J. S., Reece, W. S., & Zieschang, K. D. (1982). Imputation of missing values when the probability of response depends on the variable being imputed. Journal of the American Statistical Association, 77(378), 251–261. doi: 10.1080/01621459.1982.10477793
- Hansen, B. E. (2008). Uniform convergence rates for kernel estimation with dependent data. Econometric Theory, 24(3), 726–748. doi: 10.1017/S0266466608080304
- Li, K.-C. (1991). Sliced inverse regression for dimension reduction. Journal of the American Statistical Association, 86(414), 316–327. doi: 10.1080/01621459.1991.10475035
- Li, B., & Wang, S. (2007). On directional regression for dimension reduction. Journal of the American Statistical Association, 102(479), 997–1008. doi: 10.1198/016214507000000536
- Little, R. J., & Rubin, D. B. (2014). Statistical analysis with missing data (Vol. 333). New York: John Wiley & Sons.
- Ma, Y., & Zhu, L. (2012). A semiparametric approach to dimension reduction. Journal of the American Statistical Association, 107(497), 168–179. doi: 10.1080/01621459.2011.646925
- Nadaraya, E. A. (1964). On estimating regression. Theory of Probability & Its Applications, 9(1), 141–142. doi: 10.1137/1109020
- Newey, W. K., & McFadden, D. (1994). Large sample estimation and hypothesis testing. Handbook of Econometrics, 4, 2111–2245. doi: 10.1016/S1573-4412(05)80005-4
- Qin, J., Leung, D., & Shao, J. (2002). Estimation with survey data under nonignorable nonresponse or informative sampling. Journal of the American Statistical Association, 97(457), 193–200. doi: 10.1198/016214502753479338
- Robins, J., & Ritov, Y. (1997). Toward a curse of dimensionality appropriate (coda) asymptotic theory for semi-parametric models. Statistics in Medicine, 16(3), 285–319. doi: 10.1002/(SICI)1097-0258(19970215)16:3<285::AID-SIM535>3.0.CO;2-#
- Silverman, B. W. (1986). Density estimation for statistics and data analysis (Vol. 26). New York: CRC press.
- Tang, G., Little, R. J., & Raghunathan, T. E. (2003). Analysis of multivariate missing data with nonignorable nonresponse. Biometrika, 90(4), 747–764. doi: 10.1093/biomet/90.4.747
- Wang, S., Shao, J., & Kim, J. K. (2014). An instrumental variable approach for identification and estimation with nonignorable nonresponse. Statistica Sinica, 24(3), 1097–1116.
- Watson, G. S. (1964). Smooth regression analysis. Sankhyā: The Indian Journal of Statistics, Series A, 26(4), 359–372.
- Xia, Y., Tong, H., Li, W., & Zhu, L.-X. (2002). An adaptive estimation of dimension reduction space. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 64(3), 363–410. doi: 10.1111/1467-9868.03411
- Zhao, J., & Shao, J. (2015). Semiparametric pseudo-likelihoods in generalized linear models with nonignorable missing data. Journal of the American Statistical Association, 110(512), 1577–1590. doi: 10.1080/01621459.2014.983234