358
Views
3
CrossRef citations to date
0
Altmetric
SHORT COMMUNICATIONS

Pseudo likelihood and dimension reduction for data with nonignorable nonresponse

, &
Pages 196-205 | Received 23 Apr 2018, Accepted 22 Aug 2018, Published online: 01 Sep 2018

ABSTRACT

Tang et al. (2003. Analysis of multivariate missing data with nonignorable nonresponse. Biometrika, 90(4), 747–764) and Zhao & Shao (2015. Semiparametric pseudo-likelihoods in generalized linear models with nonignorable missing data. Journal of the American Statistical Association, 110(512), 1577–1590) proposed a pseudo likelihood approach to estimate unknown parameters in a parametric density of a response Y conditioned on a vector of covariate X, where Y is subjected to nonignorable nonersponse, X is always observed, and the propensity of whether or not Y is observed conditioned on Y and X is completely unspecified. To identify parameters, Zhao & Shao (2015. Semiparametric pseudo-likelihoods in generalized linear models with nonignorable missing data. Journal of the American Statistical Association, 110(512), 1577–1590) assumed that X can be decomposed into U and Z, where Z can be excluded from the propensity but is related with Y even conditioned on U. The pseudo likelihood involves the estimation of the joint density of U and Z. When this density is estimated nonparametrically, in this paper we apply sufficient dimension reduction to reduce the dimension of U for efficient estimation. Consistency and asymptotic normality of the proposed estimators are established. Simulation results are presented to study the finite sample performance of the proposed estimators.

1. Introduction

Missing data or nonresponse is common in various statistical applications such as sample surveys and biomedical studies. Let Y be a univariate response variable subject to nonresponse, X be a vector of covariates that is always observed, and R be the indicator of whether Y is observed. When the propensity , missing data are ignorable and there is a rich literature on methodology of handling nonresponse (Little & Rubin, Citation2014). In many applications, however, Y cannot be excluded from the propensity and missing data are nonignorable. With nonignorable nonresponse, it is challenging to estimate unknown characteristics in the conditional distribution of Y given X or the unconditional distribution of Y.

Throughout we use or as a generic notation for the conditional or unconditional probability density with respect to an appropriate measure. When nonresponse is nonignorable and both and are nonparametric, is not identificable (Robins & Ritov, Citation1997). When both and have parametric forms, maximum likelihood methods have been developed (Baker & Laird, Citation1988; Greenlees, Reece, & Zieschang, Citation1982). Since parametric methods are sensitive to model violations, efforts have been made under semiparametric models. Qin, Leung, and Shao (Citation2002) and Wang, Shao, and Kim (Citation2014) imposed a parametric model on but allowed to be nonparametric. Assuming , i.e., the entire covariate vector X can be excluded from the propensity, Tang, Little, and Raghunathan (Citation2003) proposed a pseudo likelihood method in which is nonparametric and is parametric: (1) where θ is an unknown parameter vector and is a conditional density which is known when θ is known. Zhao and Shao (Citation2015) extended the pseudo likelihood method to the case where part of X can be excluded from the propensity, i.e., (2) where and the covariate Z, termed as instrumental variable, cannot be excluded from in (Equation1). Here is a brief description of what has been done under (Equation1)–(Equation2). Under (Equation1)–(Equation2) and the Bayes formula, (3) Let , , be n independent and identically distributed observations from . Based on (Equation3), if is known, then we can estimate θ by maximising the following likelihood: (4) Usually is unknown. Substituting in (Equation4) by its estimate results in a pseudo likelihood. For example, Zhao and Shao (Citation2015) assumed a parametric model for and replaced in (Equation4) by , where is an estimator of η based on , , and is the empirical distribution of , . To avoid model misspecification on , Zhao and Shao (Citation2015) also suggested a nonparametric kernel estimator to replace in (Equation4).

However, kernel estimation is unstable when the dimension of is not small. The purpose of our work is to propose an alternative way to handle in (Equation4), which adopts dimension reduction techniques to improve the resulting pseudo likelihood estimator of θ. Our main idea is described in the next section, along with the proposed pseudo likelihood and estimator of θ. Under typical conditions for kernel estimation, our proposed pseudo likelihood estimator is asymptotically normal with convergence rate . We also perform some simulations to examine the finite sample properties of our proposed estimator.

2. Methodology and theory

As pointed out in the previous section, the main contribution of this paper is to estimate in the denominator of (Equation4) in a more efficient way, especially when the covariate U is of high dimension. Often times, the dimension of the instrument variable Z is small, while the covariate U contains a lot of variables (demographic variables for instance), and its dimension p is not small. A straightforward idea is to split as , instead of as in Zhao and Shao (Citation2015). Since does not involve θ, the likelihood in (Equation4) is equivalent to (5) and our main task is to estimate the denominator in (Equation5), i.e., the integral (6) for arbitrarily given θ, y and u. Since the real form of is not our main concern, for robustness, we adopt a nonparametric kernel regression, called Nadaraya-Watson (NW) estimation (Nadaraya, Citation1964; Watson, Citation1964), to estimate the conditional expectation in (Equation6). The way we split , along with the nonparametric estimation, actually frees ourselves from parameterising or modelling the relationship between Z and U.

Before building our estimators, we introduce a generic notation for a kernel with an appropriate dimension and bandwidth h, i.e., appeared in different places may be different. In what follows is chosen to be a product kernel of dimension s and order in the sense that , where is the jth component of the s-dimensional x and is a bounded and Lipschitz continuous univariate kernel having a compact support and satisfying , is finite and nonzero, and for all 0<l<m. The NW estimator of in (Equation6) is (7) Substituting this estimator into (Equation5) leads to a maximum pseudo likelihood estimator of θ, (8) Although p, the dimension of U, does not exert a direct influence on the convergence rate of as shown in Theorem 2.1, it is well known in the literature that kernel estimators do not perform well when p is very large. For example, their convergence requires a very large sample size n.

Fortunately, there is a well-developed nonparametric method called Sufficient Dimension Reduction (SDR) (Cook & Weisberg, Citation1991; Li & Wang, Citation2007; Li, Citation1991; Ma & Zhu, Citation2012; Xia, Tong, Li, & Zhu, Citation2002), which helps to reduce the dimension of predictors by finding a matrix B with the smallest possible such that , meaning that Z is only related to U via , where is the transpose of B. It is common that d<p and we are able to improve the estimation in (Equation7) and (Equation8) by applying NW estimation directly to Z and with being an SDR estimator of B. Starting from sliced inverse regression (SIR) (Li, Citation1991), research has been done in the literature to develop SDR estimators of B, including sliced average variance estimation (SAVE) (Cook & Weisberg, Citation1991), directional regression (DR) (Li & Wang, Citation2007), (conditional) minimum average variance estimation (MAVE ) (Xia et al., Citation2002), semiparametric approach to dimension reduction (Ma & Zhu, Citation2012) and etc. We adopt SIR method developed by Li (Citation1991) to estimate B in our simulation studies in Section 3 because it is easy to implement and works out well in practice.

When , in (Equation6) is equal to , which can be estimated by a new kernel estimator Then, a new maximum pseudo likelihood estimator of θ, (9) In the rest of this section we establish asymptotic consistency and normality of our proposed estimator in (Equation8) and in (Equation9). We first introduce some notation. Let and be the first derivative with respect to θ. For the estimator without SDR, denote (10) and the kernel estimator of γ is denoted by (11) Therefore, maximising the pseudo likelihood in (Equation8) is the same as maximising the pseudo log-likelihood (12) Differentiating with respect to θ, we obtain the score function (13) Thus, can also be obtained by solving .

We use the same notation γ as in (Equation10) for the estimator with SDR, (14) Its kernel estimator is (15) The pseudo log-likelihood and score function of the can be obtained by replacing in (Equation12) and (Equation13) by .

As the structure of two etimators are alike, we only provide the sufficient conditions for in Assumptions 2.1–2.2 where γ is defined as in (Equation10). For , however, Assumptions 2.1–2.2 need to be modified by replacing (Equation10) by (Equation14), (Equation11) by (Equation15) and p by d. In addition, in order for to converge with rate , should also possess some asymptotic properties as indicated in Assumption 2.3 with γ defined in (Equation14). In fact, SDR estimator and kernel estimators and all have good asymptotic performances (Bierens, Citation1987; Li, Citation1991; Nadaraya, Citation1964; Watson, Citation1964).

Throughout, denotes the true but unknown value of θ.

Assumption 2.1

There exists a constant c>0 such that . The function has bounded mth derivative with respect to u. There exists a q>1 such that as and is bounded.

Let be the sup-norm. Under Assumption 2.1, and , which means that the estimator converges uniformly to γ at a certain rate, i.e., (Hansen, Citation2008; Newey & McFadden, Citation1994).

Assumption 2.2

  1. There exists such that where is defined in (Equation12).

  2. for with small enough and and are bounded as functions of y and u.

  3. For small enough is continuously differentiable in θ on a neighbourhood of where is defined in (Equation13). There exists a with such that for an . exists and is nonsingular.

  4. and as .

Assumptions 2.1–2.2 together guarantee that is consistent for . As to , there is an extra step of SDR estimation requiring Assumption 2.3.

Assumption 2.3

Let for some where is the dimension of z, and γ is as defined in (Equation14).

  1. Uniformly in Ω, The mth derivatives of and on are locally Lipschitz-continuous as functions of . and each entry in the matrices are locally Lipschitz-continuous and bounded from above as a functions of .

  2. in probability as and where .

  3. The bandwidth for .

Note that B is a matrix except when d=1, and when it comes to the calculation of derivatives or norms, we are treating B as , which denotes the vector formed by concatenating the columns of B. For simplicity, we are still using B to represent . Similarly, we denote as .

Theorem 2.1

  1. If Assumptions 2.1–2.2 hold. Then, as in probability and where means convergence in distribution, and

  2. If Assumptions 2.1–2.2 hold with (Equation10), (Equation11) and p replaced by (Equation14), (Equation15) and d, respectively, and if Assumption 2.3 also holds. Then, as in probability and where

It follows from Theorem 2.1 that the asymptotic variance of may or may not be smaller than that of , partly due to the variability of SDR estimation. Owing to SDR, the dimension of kernel estimation is reduced from p to d, which is the main advantage of over , although they have the same convergence rate. A problem with a not so small p is the selection of bandwidth h for kernel estimation. Specifically, if with a , then from Assumptions 2.1 and 2.2(iv), τ must be between and for , where m is the order of kernel. When p is not so small, the upper bound is pretty small, and we may need to increase the kernel order m in order to find such τ satisfying the constraints. If m>2, the kernel takes negative values, which may reduce the stability of kernel estimation. For example, if q=2 in Assumption 2.1, then we must have and we cannot use m=2 when p>3; if p=8, then we must use a kernel of order m=5. On the other hand, if we reduce U to with dimension d, the upper bound increases to , allowing enough flexibility in choosing τ to be larger than the lower bound , which results in a low-order kernel. If q=2, then we may use m=2 when.

Although both and have convergence rate , may still have an edge over in finite sample performance, because of smaller dimension and order used in the kernel estimation. This is supported by the simulation results in Section 3.

3. Simulation studies

We study the finite-sample performance of the proposed pseudo likelihood estimators in two simulation studies. Four estimators are compared: the estimator based on full data assuming no missing data (), the estimator based on complete case analysis (), the maximum pseudo likelihood estimator without dimensional reduction ( defined in (Equation8)), and the maximum pseudo likelihood estimator with dimensional reduction ( defined in (Equation9)). The NW estimations are computed using a Gaussian kernel . The bandwidths are selected by rule of thumb proposed by Silverman (Citation1986), i.e., , where is the estimated standard deviation of U. For SDR, we apply SIR in Li (Citation1991) with the number of slices . All results are based on 1000 simulation replications and sample size n=400.

3.1. Experiment 1

In the first experiment, , where and . Hence, the dimension p equals 6. To evaluate the performance of SDR, two models of are studied: and , where d equals 1 and 2, respectively. The response Y is generated as , where . The propensity , resulting in unconditional response rates 73% and 74% in two cases, respectively. Table  reports the estimators and their standard deviations.

Table 1. Simulation results of experiment 1.

The simulation results can be summarised as follows. First, is nearly unbiased for θ no matter d equals 1 or 2. Second, the bias of may not be negligible and, in some cases, the bias is comparable to that is biased in theory. As we discussed in Section 2, compared with , the assumptions are harder to satisfy for . Third, the standard deviations of and are quite close and, in some cases, has slightly smaller standard deviation than that of .

3.2. Experiment 2

In the second experiment, we consider a binary outcome Y . Covariate U is generated the same as that in the first experiment, , or . The binary Y is generated according to , where . The propensity , resulting unconditional response rates 71% or 72%. The simulation results are reported in Table . The conclusions are quite similar to those for experiment 1, but the bias of is more serious: the bias of are even larger than the bias of . Meanwhile, is still nearly unbiased.

Table 2. Simulation results of experiment 2.

4. Proofs

Proof of Theorem 2.1

We first prove part (i). Here, γ is defined as in (Equation10) and is defined as in (Equation11). When , denote as . Note that, in Assumption 2.1, we assume that there exists a constant c>0 such that . Moreover, the univariate kernel is bounded, hence, there exists a constant b>0 such that .

For the consistency of , we only need to verify the assumptions of Theorem 2(a) in Zhao and Shao (Citation2015). Note that Assumptions 2.1 and 2.2(i), (iv) guarantee that is consistent. Then, Hence, Since , in probability as . Therefore, Following Theorem 2 in Zhao and Shao (Citation2015), with Assumption 2.2(i), in probability as , where is the true value of θ.

For the asymptotic normality of , we only need to verify the assumptions of Theorem 8.11 in Newey and McFadden (Citation1994).

Step 1. Let , and We would like to prove Note that (16) which follows from the Assumptions 2.1 and 2.2(ii).

Step 2. Let We would like to prove and . For , V-statistics method is applied. We use the result of Lemma 8.4 by Newey and McFadden (Citation1994) directly. In our case, . Then Let , then where and Then, following Assumption 2.2(ii) and (iv), we obtain that Hence, .

As to , by Chebychev's Inequality, since , which follows the Assumptions 2.1 and 2.2(ii).

Step 3. Note that Let , and Then, We would like to prove By Chebychev's Inequality, we only need to prove So we only need to prove and . Let Then, Therefore, by Assumption 2.2(ii) and (iv), . And Note that, for some constant C, and Since when n is large enought, and is bounded, Then by conditional dominated convergence theorem, By Dominated Convergence Theorem, .

Then, we combine Step 1–3, based on Theorems 8.11 and 8.12 in Newey and McFadden (Citation1994), with Assumption 2.2(iii), and .

Second, we prove part (ii). Here, γ is defined as in (Equation10) and is defined as in (Equation15). We consider the normality of . When is replaced by the true SDR direction B, the proof is exactly the same as in part (i). Then we could easily get that . Now we only need to consider Based on Lemma 3 in Ma and Zhu (Citation2012), with Assumption 2.3(i) and (iii), And, with Assumption 2.3(ii), we can easily prove the normality of Then we finish the proof of part (ii).

Disclosure statement

No potential conflict of interest was reported by the authors.

Additional information

Funding

This work was supported by Division of Mathematical Sciences [1612873] and the Chinese Ministry of Education 111 Project [B14019].

Notes on contributors

Ji Chen

Ji Chen is a PhD candidate in East China Normal University.

Bingying Xie

Bingying Xie is a statistician at Roche in Shanghai, China.

Jun Shao

Jun Shao is a professor in University of Wisconsin-Madison.

References

  • Baker, S. G., & Laird, N. M. (1988). Regression analysis for categorical variables with outcome subject to nonignorable nonresponse. Journal of the American Statistical association, 83(401), 62–69. doi: 10.1080/01621459.1988.10478565
  • Bierens, H. J. (1987). Kernel estimators of regression functions. In Advances in econometrics: Fifth world congress (Vol. 1, pp. 99–144).
  • Cook, R. D., & Weisberg, S. (1991). Comment. Journal of the American Statistical Association, 86(414), 328–332.
  • Greenlees, J. S., Reece, W. S., & Zieschang, K. D. (1982). Imputation of missing values when the probability of response depends on the variable being imputed. Journal of the American Statistical Association, 77(378), 251–261. doi: 10.1080/01621459.1982.10477793
  • Hansen, B. E. (2008). Uniform convergence rates for kernel estimation with dependent data. Econometric Theory, 24(3), 726–748. doi: 10.1017/S0266466608080304
  • Li, K.-C. (1991). Sliced inverse regression for dimension reduction. Journal of the American Statistical Association, 86(414), 316–327. doi: 10.1080/01621459.1991.10475035
  • Li, B., & Wang, S. (2007). On directional regression for dimension reduction. Journal of the American Statistical Association, 102(479), 997–1008. doi: 10.1198/016214507000000536
  • Little, R. J., & Rubin, D. B. (2014). Statistical analysis with missing data (Vol. 333). New York: John Wiley & Sons.
  • Ma, Y., & Zhu, L. (2012). A semiparametric approach to dimension reduction. Journal of the American Statistical Association, 107(497), 168–179. doi: 10.1080/01621459.2011.646925
  • Nadaraya, E. A. (1964). On estimating regression. Theory of Probability & Its Applications, 9(1), 141–142. doi: 10.1137/1109020
  • Newey, W. K., & McFadden, D. (1994). Large sample estimation and hypothesis testing. Handbook of Econometrics, 4, 2111–2245. doi: 10.1016/S1573-4412(05)80005-4
  • Qin, J., Leung, D., & Shao, J. (2002). Estimation with survey data under nonignorable nonresponse or informative sampling. Journal of the American Statistical Association, 97(457), 193–200. doi: 10.1198/016214502753479338
  • Robins, J., & Ritov, Y. (1997). Toward a curse of dimensionality appropriate (coda) asymptotic theory for semi-parametric models. Statistics in Medicine, 16(3), 285–319. doi: 10.1002/(SICI)1097-0258(19970215)16:3<285::AID-SIM535>3.0.CO;2-#
  • Silverman, B. W. (1986). Density estimation for statistics and data analysis (Vol. 26). New York: CRC press.
  • Tang, G., Little, R. J., & Raghunathan, T. E. (2003). Analysis of multivariate missing data with nonignorable nonresponse. Biometrika, 90(4), 747–764. doi: 10.1093/biomet/90.4.747
  • Wang, S., Shao, J., & Kim, J. K. (2014). An instrumental variable approach for identification and estimation with nonignorable nonresponse. Statistica Sinica, 24(3), 1097–1116.
  • Watson, G. S. (1964). Smooth regression analysis. Sankhyā: The Indian Journal of Statistics, Series A, 26(4), 359–372.
  • Xia, Y., Tong, H., Li, W., & Zhu, L.-X. (2002). An adaptive estimation of dimension reduction space. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 64(3), 363–410. doi: 10.1111/1467-9868.03411
  • Zhao, J., & Shao, J. (2015). Semiparametric pseudo-likelihoods in generalized linear models with nonignorable missing data. Journal of the American Statistical Association, 110(512), 1577–1590. doi: 10.1080/01621459.2014.983234

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.