1,742
Views
2
CrossRef citations to date
0
Altmetric
DISCUSSION PAPER

Statistical inference for nonignorable missing-data problems: a selective review

&
Pages 105-133 | Received 22 Aug 2018, Accepted 08 Sep 2018, Published online: 05 Oct 2018

ABSTRACT

Nonignorable missing data are frequently encountered in various settings, such as economics, sociology and biomedicine. We review statistical inference for nonignorable missing-data problems, including estimation, influence analysis and model selection. For estimation of mean functionals, we review semiparametric method and empirical likelihood (EL) approach. For estimation of parameters in exponential family nonlinear structural equation models, we introduce expectation-maximisation algorithm, Bayesian approach, and Bayesian EL method. For influence analysis, we investigate the case-deletion method and local influence analysis method from the frequentist and Bayesian viewpoints. For model selection, we present the modified Akaike information criterion and penalised method.

1. Introduction

In many applications, such as economics, sociology and biomedicine, some subjects may missing responses or predictors due to various reasons such as study drop-out, study subjects' unwilling to answer some items on a questionnaire, loss of information caused by uncontrollable factors, among others. Statistical inference for missing-data problems is quite challenging.

Over the past years, various statistical methods have been developed to handle missing data based on three different missingness data mechanisms (Little & Rubin, Citation2002): missing completely at random (MCAR), where missingness does not depend on any observed or unobserved data; missing at random (MAR), where missingness only depends on the observed data; missing not at random (MNAR), where missingness is associated with the unobserved data, perhaps in addition to the observed data, which is the most difficult to handle. Generally, the MCAR and MAR are called ignorable missingness, while the MNAR is termed nonignorable missingness.

The most commonly used method in missing data analysis is the complete-case analysis (i.e., statistical inference is performed only on the completely observed data). However, this method may be biased or inefficient when missingness data mechanism is not MCAR. Another popular approach to handle missing data is the imputation, including single imputation such as the hot-deck imputation (i.e., a missing value is imputed from a randomly selected similar record), the cold-deck imputation, the mean imputation (i.e., a missing value is imputed with the mean of that variable for all other observed cases), which has some attractive properties for univariate analysis but becomes problematic for multivariate analysis in that it attenuates correlation between the variable to be imputed and regression imputation (i.e., a fitted value from the considered regression model is used to impute the missing value), multiple imputation (MI) and expectation-maximisation (EM) method (Dempster, Laird, & Rubin, Citation1977). In particular, Horvitz & Thompson's (Citation1952) inverse probability weighting (IPW) method is also a popular method, which may be unstable due to the usage of inverse probabilities when the probabilities are quite small even if it is unbiased under MAR, and may also suffer from efficiency loss due to incomplete data omitted. To incorporate additional information from incomplete subjects, Robins, Rotnitzky, and Zhao (Citation1994) proposed the augmented inverse probability weighted (AIPW) method, which is also referred to as the propensity score (PS) weighting approach. Most of existing work associated with the PS method mainly focuses on the MAR assumption. When the missing data are MNAR, standard PS methods developed for ignorable missing data may not work in that missing variables in the PS model are not always observed. The PS analysis with nonignorable missing data is quite challenging. To this end, many modified PS methods have been proposed to handle missing data over the past years, for example, see Hansen (Citation1982), Chang and Kott (Citation2008), Kim and Shao (Citation2013), Riddles (Citation2013), Wang, Shao, and Kim (Citation2014), Riddles, Kim, and Im (Citation2016), and Jiang, Zhao, and Tang (Citation2016).

Model-based approaches provide additional tools for handling missing data including ‘testing’ missing data types and estimating parameters in the considered models. The widely used methods include the maximum likelihood (ML) or quasi-likelihood method directly maximising the likelihood or quasi-likelihood of the observed data, and the EM algorithm for evaluating the ML estimates (MLEs) from the complete-data likelihood when the likelihood function or quasi-likelihood function is available, which has extended to various statistical models, for instance, see Laird and Ware (Citation1982), Ibrahim (Citation1990), Ibrahim, Chen, and Lipsitz (Citation2001), Lee and Zhu (Citation2002), Zhao and Tang (Citation2015), and Tang and Tang (Citation2018).

Also, the empirical likelihood (EL) method, which is proposed by Owen (Citation1988), is widely applied to missing data analysis due to its attractive properties. For example, it has sampling properties similar to the bootstrap, but instead of resampling. There are considerable literature on missing data analysis, including estimations of mean functionals and model parameters using the EL method. For instance, see Cheng (Citation1994), Wang and Rao (Citation2002), Liang, Wang, and Carroll (Citation2007), Stute, Xue, and Zhu (Citation2007), Xue (Citation2009), Qin, Zhang, and Leung (Citation2009), Wang and Chen (Citation2009), Tang and Zhao (Citation2013), Wang and Qin (Citation2010), Chen and Kim (Citation2017), Zhao, Tang, and Tang (Citation2013), Zhao, Zhao, and Tang (Citation2013), and Tang, Zhao, and Zhu (Citation2014).

Bayesian parameter estimation method is also often used to handle the estimation problem of the parametric models with missing data due to the following facts. First, the posterior distributions of parameters and nonparametric function and missing data can be estimated using the well-known Markov chain Monte Carlo (MCMC) methods in statistical computations, such as the Gibbs sampler (Geman & Geman, Citation1984) and the Metropolis-Hastings (MH) algorithm (Hastings, Citation1970; Metropolis, Rosenbluth, Rosenbluth, Teller, & Teller, Citation1953). Second, Bayesian method with missing data only requires an additional step in the Gibbs sampler compared with no missing-data settings. Thus, Bayesian method can easily accommodate missing data without requiring new techniques for statistical inference (Ibrahim, Chen, Lipsitz, & Herring, Citation2005). Third, some prior information can be directly incorporated into the analysis, which yields more precise estimates of parameters in settings with good prior assumption. Fourth, the sampling-based Bayesian methods do not depend on asymptotic theory, and may give more reliable statistical inference even with small sample sizes (Lee & Tang, Citation2006a). There are a lot of discussions on Bayesian analysis for missing data problems. For example, see Ibrahim et al. (Citation2005), Lee and Tang (Citation2006b), Lee and Tang (Citation2006c), Daniels, Wang, and Marcus (Citation2014), Linero and Daniels (Citation2015), Tang, Chow, Ibrahim, and Zhu (Citation2017), Zhang and Tang (Citation2017).

The aforementioned model-based methods heavily depend on the missingness data mechanism and modelling assumptions, thus the resulting estimators may be sensitive to these assumptions. Hence, sensitivity analysis for the missing data problems, which is usually used to investigate the effect of the data and model assumptions on estimation outputs, has received considerable attention over the past years. To address the issue, there are two major approaches: case-deletion method (Cook, Citation1977) and local influence analysis (Cook, Citation1986), which have been extended to various statistical models in the absence of missing data (e.g., see Cook & Weisberg, Citation1982; Wei, Citation1998). Also, the two approaches have been extended to the missing data problems. For example, see Lavine (Citation1991), Berger (Citation1994), Dey, Ghosh, and Lou (Citation1996), Gustafson (Citation1996a,Citation1996b), Zhu and Lee (Citation2001), Verbeke, Molenberghs, Thijs, Lesaffre, and Kenward (Citation2001), Jansen, Molenberghs, Aerts, Thijs, and Van Steen (Citation2003), Troxel, Ma, and Heitjan (Citation2004), Millar and Stewart (Citation2007), Daniels and Hogan (Citation2008), Shi, Zhu, and Ibrahim (Citation2009), Zhu, Ibrahim, Cho, and Tang (Citation2012), Zhu, Ibrahim, and Chen (Citation2015). In particular, Zhu, Ibrahim, and Tang (Citation2011) developed a unified approach of Bayesian influence analysis for simultaneously assessing the effect of various perturbations to the data, the prior and the sampling distribution for a class of statistical models. Several extensions of Zhu, Ibrahim, and Tang (Citation2011) have been considered in recent years, for example, see Kaciroti and Raghunathan (Citation2014), Zhu, Ibrahim, and Tang (Citation2014), and Zhang and Tang (Citation2017).

However, in practical applications, the assumed missing data mechanism models are not testable from the data due to the missing value involved. To this end, various approaches have been developed to select an appropriate model, which fits the data best among the candidate models. This is a growing, active and open research field. The commonly used approaches to select the best fitting model include the stepwise regression and best subset selection associated with the Akaike's information criterion (AIC) (Akaike, Citation1973) and Bayesian information criterion (BIC) (Schwarz, Citation1978) in the frequentist framework, which heavily depend on the observed data likelihood. In the presence of missing data, it is rather difficult to obtain the observed data likelihood due to the intractable multiple integration involved. In this case, many model selection methods have been developed for missing data based on the EM algorithm or imputation technique. For example, see Hens, Aerts, and Molenberghs (Citation2006), Ibrahim, Zhu, and Tang (Citation2008), Shen and Chen (Citation2012, Citation2013), Long and Johnson (Citation2015), and Jiang, Nguyen, and Rao (Citation2015).

When the number of predictors is large, the aforementioned model selection strategies often suffer from instability and heavily computational problem in the presence of missing data. To overcome the difficulties, various regularisation methods have been proposed to simultaneously estimate parameters and select important predictors over the past years. For example, see Fang and Shao (Citation2016), and Tang and Tang (Citation2018).

The rest of this paper is organised as follows. Section 2 introduces missing data mechanisms. Section 3 reviews two approaches to estimate the mean functionals in the presence of nonignorably missing response data. Section 4 investigates the parameter estimation problem in exponential family nonlinear structural equation models (EFNSEMs) with nonignorable missing data including EM algorithm, Bayesian method and Bayesian EL approach. Section 5 introduces influence analysis methods in the presence of missing response not at random. Section 6 discusses the variable selection problem in the presence of missing response not at random. Some concluding remarks are given in Section 7.

2. Missing data mechanism

In incomplete data analysis, the first problem being faced is how to deal with missing data. Most works in early missing studies suggest using the complete and available cases, i.e., omitting/deleting these incompletely observed subjects, to make statistical inference. This opinion essentially thinks that subjects with missing values are just a random subset of the complete sample of subjects. However, it is unreasonable when the data present the dependent structure. This issue has not been attracted researchers' attention until Rubin's pioneering work Rubin (Citation1976). Rubin (Citation1976) suggested that each incomplete data has its own probability generating mechanism and each observation according to some particular probability given the incomplete data is missing or observed. This opinion naturally admits that the ‘true’ values of complete data exist, but some values are observed while the others are missing with some uncertain reasons. Thus, modelling such uncertainty is a key work in missing data analysis. To efficiently quantify such uncertainty of missing value, Rubin (Citation1976) first defined three missing data mechanisms: missing completely at random (MCAR), missing at random (MAR) and missing not at random (MNAR). In accordance with Little and Rubin (Citation2002), missing data are ‘MCAR’ if the failure to observe a value does not depend on any observed or missing values; missing data are ‘MAR’ if the failure to observe a value only depends on the observed values rather than any unobserved values; missing data are ‘MNAR’ if the failure to observe a value depends on the value that would have been observed. This modelling strategy does not intend to explore the reasons leading to missingness (in fact, accurately describing all the potential causes or reasons for missingness is not realistic). Instead, it adopts a mathematical device to describe the rates and patterns of missing values and to roughly capture possible relationships between the missingness and the values of missing variables themselves. In this sense, it is consistent to statistical sayings ‘learning from what the data speak’. To comprehensively understand the definition of missing data mechanism, a formal mathematical form of missing data mechanism is given as follows.

Let X be a vector of explanatory variables and Y be a response variable, and be n independent realisations of continuous random variables from a joint distribution . Here, we assume that is always observed but is subject to missingness. Let be the missing data indicator for , that is, if is missing, and otherwise. Throughout this paper, we assume that is a Bernoulli random variable with the respondent probability , where is referred to as the propensity score of missing response, and is independent of for any . Thus, we assume that the missing data mechanism for is given by When 's are unrelated with the data , the missing data mechanism is called ‘MCAR’, which is the simplest missing data mechanism. For the MCAR mechanism, subjects who have missing data are just a random subset of the complete sample of subjects, and the widely used technique for handling missing data is the complete-case analysis yielding unbiased conclusions. When 's only depend on rather than (i.e., ), the missing data mechanism is termed ‘MAR’, which is a commonly used assumption in missing data literature. Under the MAR assumption, the complete-case analysis yields biased results. Thus, some new methods have been developed to handle missing data, for example, see Cheng (Citation1994) and Wang and Rao (Citation2002). When 's depend on perhaps in addition to , the missing data mechanism is referred to as MNAR, which is also called nonignorable. MNAR data analysis is rather challenging due to missing value involved. Using MAR data analysis method to make inference on MNAR data may lead to biased results. To address the issue, some new methods for handling MNAR data have been proposed in recent years, for example, see Qin, Leung, and Shao (Citation2002), Kim and Yu (Citation2011) and Tang et al. (Citation2014).

Under some mild conditions, Rubin (Citation1976) showed that statistical inferences based on the observed data and missing information specified by MCAR and/or MAR are the same as those based on the observed data, and hence the MCAR and MAR missing data mechanisms are called ignorable. Under MNAR missing data mechanism, efficient inferences usually require correctly specifying either missing data mechanism model or distribution of , or both. Thus, resulting conclusions may be sensitive to these assumptions. Therefore, correctly specifying missing data mechanism is rather important in MNAR data analysis but is quite difficult in practical applications because of the missing data mechanism is not testable.

3. Estimation of mean functionals with MNAR

In this section, our main interest is to estimate in the presence of missing response not at random.

3.1. Semiparametric estimation

In the nonignorable missing data literature, the missing data indicator is usually formulated by a fully parametric model, such as logistic model, probit model and log-linear model, based on the assumption that covariates has a linear effect on the function of the respondent probability for unobserved . In practice, this assumption may be inappropriate in some settings because the true relationship between the respondent probability for and covariate is usually unknown. Moreover, the fully parametric model is sensitive to the misspecification of the posited model, and is not testable under nonignorable missingness (Ibrahim et al., Citation2001). To solve the aforementioned problem, Kim and Yu (Citation2011) defined the following exponential tilting model for the propensity score: (1) where is some unknown function, and φ is an unknown titling parameter. Thus, (Equation1) defines a semiparametric model and is weaker than the fully parametric model given in Lee and Tang (Citation2006b). To estimate the propensity , Kim and Yu (Citation2011) considered two cases that φ is known or can be estimated using the additional data. When , model (Equation1) reduces to an MAR model. Let be the conditional density of given and , and let be the conditional density of given and . Then, following the argument of Kim and Yu (Citation2011), we have (2) where the tilting parameter measures the amount of departure from the MAR assumption. When , (Equation2) reduces to , which shows that our considered missingness data mechanism is a natural extension of MAR mechanism, and MAR mechanism is a special case of our considered missingness data mechanism.

Under the assumption of model (Equation1), and γ are unidentifiable when both are unknown Shao and Wang (Citation2016). To address the identifiability with an unknown tilting parameter φ, Shao and Wang (Citation2016) show that all unknown parameters can be identified and consistently estimated if one can find an instrument (e.g., z) that is not emerged in (Equation1), i.e., if and (3) where is still an unknown function of u.

Following Shao and Wang (Citation2016), when γ is known, we can rewrite (Equation1) as . Thus, we have (4) which leads to When γ is the true parameter, it follows from Kim and Yu (Citation2011) that the kernel regression estimator for can be expressed as where , and is a symmetric kernel function on the real line and is a smoothing bandwidth sequence such that and as .

When γ is unknown, under model (Equation3), Equation (Equation4) still holds with x replaced by u. It is assumed that the instrument variable z takes the finite discrete values such as . For each l, we define where is the indicator function. Denote , . Then, we have (5) which is called instrumental estimating equations Shao and Wang (Citation2016). Note that when L=1, there is no instrumental variable, Equation (Equation5) reduces to (Equation4) and is not sufficient for estimating and γ.

To estimate , the idea of profiling is here adopted. For every fixed γ, the sum of L equations in (Equation5) is used to obtain the kernel estimator of , which is given by where the bandwidth depends on l. Note that is not an estimator of unless γ is the true parameter, but it is useful for profiling. More importantly, estimating only uses the sum of L equations, the left L−1 estimating equations are used to estimate unknown parameter γ Shao and Wang (Citation2016), which shows that we require . After we obtain , substituting for in (Equation5) leads to the following profiled L estimating equations: (6) A sample version of the profiled instrumental estimating Equation (Equation6) is (7) When L is greater than , where is the dimension of γ, (Equation7) has no solution due to its over-identifying. In this case, a two-step generalised method of moments Hansen (Citation1982) is adopted to estimate γ. To save space, we omit it. The details see Shao and Wang (Citation2016).

When the missing data mechanism for Y is not MAR but MNAR, the estimator with Cheng (Citation1994) is biased. In this case, θ can be estimated by , where is a consistent estimator of . Under MNAR assumption, estimating is quite challenging in that is not obtained in the set of nonrespondents. When γ is known, Kim and Yu (Citation2011) proposed a nonparametric regression estimator for , which is given by where the weight presents the point mass assigned to when is approximated by . Thus, a nonparametric estimator of θ can be obtained by When , estimator reduces to . Under some regularity conditions, Kim and Yu (Citation2011) showed where . The increase in variance due to missing data is , where . The above equation indicates that the variance increase depends on the respondent probability and the squared error . When , the variance increase is zero. When is quite small, the variance increase can be quite large. When does not depend on Y (i.e., ), reduces to the above defined . In practical applications, is usually unknown. To this end, Kim and Yu (Citation2011) presented a consistent estimator of , which is given by where , and with .

When γ is unknown, γ can be estimated from an independent survey or a validation sample, which is a subsample of the nonrespondents. Let be the corresponding estimator of γ. Thus, a semiparametric estimator of is given by , which indicates that a semiparametric estimator of θ can be expressed by . When is obtained from an independent survey, under some assumptions such that and is independent of , Kim and Yu (Citation2011) showed where , and . The above equation shows that the increase in variance is due to estimating γ. A consistent estimator of can be expressed as , where is a consistent estimator of , and with . When is obtained from a validation sample, under some regularity conditions, Kim and Yu (Citation2011) proved where , in which r is an indicator function taking 1 if subject belongs to the follow-up sample and 0 otherwise, and , and is the probability limit of . It is easily seen from the definition of that when , and is minimised due to the fact that when the assumed respondent model (Equation1) is true. Thus, the validity of the proposed estimator does not depend on the assumed respondent model and the role of the respondent model is to improve the efficiency. A consistent estimator of is given by where .

3.2. EL estimation

Due to some good properties of EL estimation, Qin et al. (Citation2002) developed a semiparametric EL approach to estimate mean functions for the data with nonignorable nonresponse. It is rather difficult to evaluate semiparametric EL estimate of θ due to simultaneously solving multiple equations. To address the issue, Zhao et al. (Citation2013) developed two EL methods to estimate under MNAR assumption. In what follows, we only consider the case that γ is known. When γ is unknown, we can substitute a consistent estimator of γ to get a semiparametric estimation. Let . The estimated empirical log-likelihood ratio function for parameter θ is given by When with probability tending to one, the optimal value of is , where is the Lagrange multiplier and satisfies . Thus, we have The maximum EL estimator (denoted by ) of θ can be obtained by maximising , which shows that under some smooth conditions, can be obtained by simultaneously solving equations: and . The nested optimisation routines (Owen, Citation2001, Ch. 12) can be here adopted to evaluate . When , reduces to the EL estimator of Wang and Rao (Citation2002).

Under MNAR assumption, it is easily shown that . Hence, the aforementioned maximum EL estimator for θ based on may be biased. To address the issue, Zhao et al. (Citation2013) proposed the augmented IPW imputation procedure. Let . Under MNAR assumption (Equation1), we have , which shows that the random variable is an unbiased estimator of θ. In applications, is usually unknown because of a nonparametric function involved. Similar to Kim and Yu (Citation2011), we can substitute for to get a semiparametric estimator of as . Then, the estimated empirical log-likelihood ratio function for θ is defined as By using the Lagrange multiplier method, can be expressed as , where satisfies . Thus, the maximum EL estimator (denoted by ) of θ can be obtained by maximising .

It is assumed that some auxiliary information on X may be available. For instance, the mean of X is equal to zero or the distribution of X is symmetric. In particular, we assume that the auxiliary information on X has the form of , where is a known vector function. In this case, the estimated empirical log-likelihood ratio function for θ can be defined as where . Using the Lagrange multiplier method, we obtain where is the solution to the following equation: . Similarly, the maximum EL estimator (denoted by ) of θ based on the auxiliary information can be obtained by maximising.

Under some regularity conditions, if θ is the true parameter, Zhao et al. (Citation2013) showed

where , , , and with and .

The above equations characterise the asymptotic normality of , and , which can be employed to construct the normal-approximation-based (NA-based) confidence interval or region for θ. For instance, an approximate 100( NA-based confidence region for θ is given by . They also indicate that and have the same asymptotic variance, which can be explained by the fact that the nonparametric kernel estimator of the response probability is a consistent estimator when γ is known. Based on the asymptotic normality of maximum EL estimators, we can approximate covariance matrices of , and . Similarly, a consistent estimator of is given by . The consistent estimators of and are given by and , respectively. Thus, the consistent estimators of and are , and , respectively.

Also, it is easily known from the above equations that (i) has smaller asymptotic variance and hence is asymptotically more efficient than and ; (ii) when the auxiliary information on X is available and X is not subject to missingness, the reduction in asymptotic variance of and and that of does not depend on the respondent probability function , which is consistent with that under MAR assumption for Y (Wang & Rao, Citation2002). When the missingness data mechanism for Y is MAR, we have , which indicates that and reduce and , respectively. Thus, Theorems 2.1 and 3.1 in Wang and Rao (Citation2002) and Theorems 1–3 in Xue (Citation2009) have been extended to our considered case.

We can use the asymptotic distributions of , and to construct the confidence region of θ. For example, let be the quantile of for . Thus, an approximate 100 EL-based confidence region for θ is defined by CI.

4. Estimation of parameters in EFNSEMs with MNAR

In educational, behavioural, social and psychological studies, structural equation models (SEMs) are widely used to identify relationships among observed and latent variables (Bentler & Wu, Citation2002; Jöreskog & Sörbom, Citation1996; Lee & Tang, Citation2004; Lee & Zhu, Citation2002). Many structural equation modelling packages have been developed to handle the strong demand in various fields; for example, see LISREL (Jöreskog & Sörbom, Citation1996), EQS6 (Bentler & Wu, Citation2002), Mplus (Muthén & Muthén, Citation2017), WinBUGS (Spieglhalter, Thomas, Best, & Lunn, Citation2003). These packages mainly focus on normal data. In many applications, there are a large number of non-normal manifest variables. To this end, it is important to develop statistical inference tools to analyze NSEMs with manifest variables following exponential family distribution, which includes many commonly encountered distributions such as the normal, exponential, binomial and Poisson distributions as its special cases.

4.1. MLE

For , let be a vector of manifest variables measured on each of n subjects, and be a vector of latent variables. To measure the relationship between and , it is assumed that components of given are conditionally independent, and () given and is distributed as the following one-parameter exponential family (8) with and , where and are specific differentiable functions. Exponential family distribution includes many distributions as its special cases, such as the normal, exponential, binomial and Poisson. Similar to generalised linear models (McCullagh & Nelder, Citation1989), we measure the relationship between and via the following measurement model: (9) where u is a vector of unknown parameters, is a vector of explanatory variables, β is a matrix of regression coefficients, Λ is a factor loading matrix. To identify the relationships among the latent variables, we partition as and assume that and satisfy the following nonlinear structural equation: (10) where and are and latent subvectors of , respectively, and ; ϒ is a matrix of regression coefficients, is a vector of covariates, is a nonzero vector-valued function with differentiable functions and ; Π () and Γ () are matrices of unknown regression coefficients of on and . Further, it is assumed that is a nonzero constant and independent of Π, and are independently distributed as and , respectively; where . Let and . Thus, the nonlinear structural Equation (Equation10) can be rewritten as.

The above defined models (or their parameters) are not identifiable, and some identification conditions on these parameters are required. Following the widely adopted technique in SEM (Lee & Tang, Citation2006b), we fix some components of Λ and/or to some known values, for example, 0 or 1.

For simplicity, we assume that the manifest vector is incompletely observed and subject to nonignorable missingness, and and are fully observed. We define a missing indicator , which takes 0 if is missing, and 1 if is observed. Similar to Lee and Tang (Citation2006b), it is assumed that the conditional distribution of and given and are independent for . Under this assumption, we consider the following model for nonignorable missingness mechanism: (11) where , , and and ϕ is a appropriate parameter vector. Since is binary, a sequence of logistic regressions can be adopted to model . Similar to Lee and Tang (Citation2006b), we consider the following logistic regression model for : (12) (LoG) where , and and . The missingness data mechanism defined in (Equation12) is called missing not at random(MNAR). More discussions on the specification of missingness data mechanism models can see Lee and Tang (Citation2006b).

Denote , , , , , , , and . Let θ be a parameter vector containing all unknown distinct parameters in , , Φ, and ϕ. The joint probability density function of θ based on the complete data is given by where , is the probability density function of given , and is the probability density function of given . Denote . Thus, the log-likelihood function of θ based on the observed data is given by which is rather complicated and has not closed-form. Hence, it is quite difficult to compute the ML estimation of θ based on . To address this issue, the well-known EM algorithm is adopted to evaluate the ML estimation of θ.

Based on the idea of data augmentation, the observed data is augmented with the missing data and latent factors ω. Thus, we obtain the complete data set . Based on the missingness data mechanism model (Equation12), the complete-data log-likelihood function for θ is given by (13) (LogLK) where (14) with , (15) (LogLK2) (16) (LogLK3) (17) (LogLK4) where , and represent the constant that don't depend on parameters. Clearly, is simpler than in that it does not involve intractable multiple integrals, and has four separate terms with separate parameters. Let be the conditional expectation of with respect to the conditional distribution , where represents the MLE of θ. The EM algorithm for evaluating the MLE of θ usually requires repeating the following two steps: E-step (i.e., Expectation step) and M-Step (i.e., Maximum step) until the algorithm converges. That is, for at the κth iteration of the EM algorithm with a current value of θ (e.g., corresponds to the initial value of θ),

E-step: compute , where the expectation is taken with respect to the conditional distribution of given ;

M-step: find by maximising with respect to θ.

It is easily seen from the definition of that it is still quite difficult to compute because of the multiple integrals involved. To this end, the well-known Monte Carlo method is employed to approximate . Following Wei and Tanner (Citation1990), Monte Carlo approximation to can be expressed as with , where is a sample drawn from the conditional distribution  via the Metropolis-Hastings (MH) algorithm within the Gibbs sampler. Here, the Gibbs sampler is implemented as follows. Given the value of θ, at the tth iteration of Gibbs sampler with the current values and of and ω, we first generate from the conditional distribution , and then draw from the conditional distribution .

In what follows, we derive the conditional distributions and . Since 's are conditionally independent, and 's are also conditionally independent and do not depend on Z, , Φ and , it follows from (Equation8) and (Equation9) that (18) (CondYmis) which indicates that the conditional distribution is proportional to (19) with , where . The MH algorithm for drawing observations from the conditional distribution is as follows. At the tth iteration of Gibbs sampler with the current values and of and , a candidate is generated from the normal distribution , where is selected with the average acceptance rate being roughly 0.25, and with being the subvector of corresponding to missing components of , , being the subvector of ϕ corresponding to , being the subvector of ϕ corresponding to , and .

It follows from the independence of subjects that . It follows from the assumptions given in Equations (Equation8), (Equation9) and (Equation12) that (20) (Condomga) with . Similarly, the MH algorithm for sampling observations from is implemented as follows. At the tth iteration of Gibbs sampler with the current values and of and , a candidate is generated from the normal distribution , where is selected as , and with and .

In the M-step, we need maximising with respect to θ. However, there is no analytic solution of θ to . To this end, the widely used ECME algorithm (Liu & Rubin, Citation1994) is used to separately maximise four terms in at every conditional maximum (CM) step. Let be the kth row vector of matrix , and be the kth component of vector . Thus, the closed forms for , and Φ at the th conditional maximum likelihood (CML) step of the EM algorithm with the current values , and are given by respectively.

But, there are no analytic solutions for u, , and ϕ at the κth CML step. To this end, Newton-Raphson maximisation procedure is adopted to evaluate the solutions of the intractable CM step for u, , ψ and ϕ because of its attractive selection in the local optimisation. Let be the jth row subvector of . Define , , and . Denote , , , , where Given the current estimates , , and of u, , and ϕ at the κth iteration of the EM algorithm, Newton-Raphson iterative formulae for evaluating , , and are given by respectively, where , , , and .

When Newton-Raphson algorithm converges, the corresponding value of is denoted as , which is the desirable ML estimates of θ. The above Newton-Raphson iterative procedure is repeated until the infinite norm is sufficiently small, for example, 0.001. Convergence of the above presented ECME algorithm is monitored by using the log-likelihood ratio statistic via the bridge sampling method (Meng & Schilling, Citation1996).

Similar to missing data literature (e.g., see Lee & Zhu, Citation2002), we can use the inverse of the observed information matrix to approximate the asymptotic covariance matrix of . However, it is rather difficult to obtain the analytic expression of the observed information matrix because of the multiple integral involved. Here, an identity of Louis (Citation1982) is adopted to evaluate standard error estimates of components in on the basis of the generated observations from the conditional distribution via the above presented MH algorithm within the Gibbs sampler. Following Louis (Citation1982), can be approximated by where , and for any vector a. Expressions of and are given in the Appendix.

4.2. Bayesian estimation

According to the above argument, it is rather difficult to make Bayesian inference on θ via the observed data likelihood function . Inspired by the idea of data augmentation (Tanner & Wong, Citation1987), we regard as hypothetical missing data, and augment them with the observed data set in the posterior analysis. The posterior distribution , which is proportional to , is simpler to evaluate than , which is the posterior distribution based on the observed data, where is the prior density of θ. However, it is rather difficult to make inference on θ from the posterior distribution because it is not easy to directly draw observations from . To address the issue, Bayesian analysis is conducted on the basis of the complete data density function through the MCMC methods, such as the Gibbs sampler (Geman & Geman, Citation1984) and the MH algorihtm. Here, the Gibbs sampler is adopted to draw a sequence of random observations from the posterior distribution , which is conducted as follows. At the th iteration of the Gibbs sampler with current values , , and , we iteratively

  1. draw observation from the conditional distribution ,

  2. draw observation from the conditional distribution ,

  3. draw observation from the conditional distribution . Following the argument of Geman and Geman (Citation1984), under some regularity conditions and for a sufficiently large , observations can be regarded as an observation from the posterior distribution . Hence, the posterior distribution can be approximated by the empirical distribution of observations , where is a sufficiently large number. The widely used ‘estimated potential scale reduction’ (EPSR, Gelman, Citation1996) values of the parameters in θ, which are computed sequentially as the iterations continue from several different starting values of parameters, is employed to monitor the convergence of the Gibbs sampler. When the EPSR values of parameters in θ are less than 1.2, we claim that the algorithm converges.

(1) Prior specifications

For the parameters u, β and Λ given in the measurement model (Equation9), we consider the following priors: (21) where and () are hyperparameters whose values are prespecified, and Gamma represents the Gamma distribution with parameters a and b. For , it is assumed that is independent of . When and are selected to be large, and is taken to be small (e.g., 0.01), the above given priors reduce to the situation with little information.

For the parameters , and Φ associated with the structure equation model (Equation10), we consider the following prior: (22) where , , , , and are hyperparameters whose value is assumed to be pre-given, and denotes the inverted Wishart distribution. When approaches to 0, it reduces to a non-informative prior.

For the parameter vector ϕ associated with the missing mechanism model (Equation12), a normal prior is specified: (23) where and are hyperparameters. Similarly, when approaches to 0, it reduces to a non-informative prior.

(2) Conditional distributions

Under the aforementioned assumptions, the joint distribution of has the following decomposition: It is reasonable to assume from the definition of θ that where , , , , , and are the prior distributions of and ϕ, respectively. Thus, the joint distribution of can be written as (24) (Joint) which is adopted to derive the conditional distributions requiring in implementing the Gibbs sampler as follows.

For the conditional distribution given in (Equation18), it follows from independence of subjects that . While the conditional distribution is given in (Equation19).

For the conditional distribution given in (Equation20), since is independent of for , it follows from (Equation24) that . While the conditional distribution is given in (Equation20).

Consider the conditional distributions , and . It follows from (Equation8), (Equation9), (Equation24) and (Equation21) that (25) (Postu) (26) (PostBeT) (27) (PostPSI) for , where and . Particularly, if does not depend on , where is some appropriate function, we have .

Consider the conditional distributions , and . It follows from (Equation10), (Equation22) and (Equation24) that the joint conditional distribution is proportional to which leads to (28) where , and . Also, it is easily shown from (Equation10), (Equation22) and (Equation24) that the conditional distribution is proportional to which yields (29) Consider the conditional distribution . It follows from (Equation12), (Equation23) and (Equation24) that (30) Note that the conditional distributions of and are given under the situation without fixed parameters. The method presented in Lee and Tang (Citation2006b) can be applied to handle the situation with fixed parameters.

(3) Implementation

It is easily seen from (Equation28) and (Equation29) that conditional distributions , and are the standard distributions. Thus, it is fast and easy to sample observations from these conditional distributions. But, conditional distributions , , , , , and given in (Equation19), (Equation20), (Equation25)–(Equation27) and (Equation30), respectively, are non-standard and complicated distributions. Hence, it is rather difficult to directly draw observations from these conditional distribution. To address the issue, the MH algorithm is adopted to generate observations from these conditional distributions. The MH algorithms for sampling and from their corresponding conditional distributions have been given in Section 4.1.

To sample observation u from via the MH algorithm, we take as the target density, and select as the proposal distribution, where is selected so that the average acceptance rate is approximately 0.25, and with in which and for . The MH algorithm is implemented as follows. At the tth iteration of the Gibbs sampler with a current value , a new candidate u is sampled from and accepted with probability Similarly, the MH algorithms for sampling observations from , and are as follows. At the tth iteration of the Gibbs sampler with the current values , and , new candidate values , and ϕ are generated from the proposal distributions , and , respectively, and accepted with probabilities respectively, where and , , and are chosen so that the average acceptance rate is approximately 0.25.

(4) Bayesian estimates of parameters and latent variables

Based on observations sampled from the aforementioned presented MH algorithm together with the Gibbs sampler procedure, we can obtain the joint Bayesian estimates of parameters in θ and latent variables 's and their corresponding standard errors.

Let be the sampled observations of from their joint conditional distribution via the aforementioned presented MCMC algorithm. Thus, Bayesian estimates of and can be evaluated by respectively. Similarly, the consistent estimate of the posterior covariance matrix var can be computed by the sample covariance matrix of observations drawn using the aforementioned presented MCMC algorithm. For example, . Hence, the estimated standard errors for components of θ can be evaluated from the diagonal elements of the sample covariance matrix.

4.3. Bayesian EL estimation

Although the aforementioned given assumptions are common and reasonable in many cases, it may lead to biased estimates or even misleading conclusions when the true distribution of response variable is highly non-exponential. In these cases, we assume that the distribution of is unknown, but satisfying , and . It is also assumed that 's are subject to non-ignorable missingness, and 's are completely observed. The assumption for is the same as that given in Section 4.1.

Let . When 's are completely observed, under the above assumption, we can construct the following estimating equations: which satisfies , where . Under the assumption given in Equation (Equation10), we can construct the following estimating equations: which satisfies . Under the assumption , we can construct the following estimating equations: which satisfies . According to the assumption given in Equation (Equation12), we can construct the following estimating equations: which satisfies , where . Also, under the assumption given in Equation (Equation10), we can construct the following estimating equations: which satisfies .

Denote , , , . Following Owen (Citation2001), we define the profile EL ratio function for θ as It follows from Owen (Citation2001) that the optimal value of can be expressed as (31) where is the root of the following equation: Thus, we have . Hence, the EL function of θ is given by , which can be regarded as the working likelihood of θ. Based on the working likelihood of θ and the prior distribution of θ, the posterior probability density of θ can be expressed as , where .

Under our considered situation that 's are subject to missingness and 's are unobserved latent variables, it is impossible to make Bayesian inference on θ based on . To address the issue, a hybrid algorithm is proposed to impute latent variables and missing responses by combining the Gibbs sampler and the MH algorithm based on the posterior distributions , and . In the Gibbs sampler, the first difficulty is the selection of the initial values of unobserved latent variables and missing responses. A simple approach for solving the issue is to independently generate several MCMC observations based on the classical SEM with the complete-case analysis. The second difficulty is to estimate the conditional distributions and . To this end, the approach proposed by Wei, Ma, and Carroll (Citation2012) is employed to estimate and based on the EL , and then latent variables and missing responses are drawn by using the linear interpolation method. Following Zhang and Tang (Citation2017), the algorithm for implementing the Gibbs sampler can be summarised as follows.

Step 1. Let t=0 and take the initial values and of θ and , respectively.

Step 2. Sample observations of θ from its posterior distribution. The MH algorithm for sampling θ is conducted as follows. At the th iteration with a current value of θ, we first draw a candidate value of θ from the proposal distribution , where is chosen such that the average acceptance rate is roughly 0.3; and then accept as with the probability where 's are generated by using the modified Newton-Raphson algorithm (Chen, Sitter, & Wu, Citation2002). Note that it is impossible to implement the above procedure because components of θ have different features. In this case, we can iteratively implement the above procedure for u, , , Φ, ϕ and .

Step 3. Sample latent variables from the corresponding conditional distributions. First, we estimate the conditional density . To this end, we take grid points (e.g., ) in some neighbourhood of , and calculate for based on Equation (Equation31). The conditional density function of latent variable is estimated by where , is the ιth component of . Thus, the cumulative distribution function of latent variable can be estimated by Following the idea of the linear interpolation method (Wei et al., Citation2012), we first sample a random variable u from the uniform distribution , and then take as the th iteration value of . Thus, .

Step 4. Sample missing response from the corresponding conditional distribution. First, we estimate the conditional density . To this end, we take grid points (e.g., ) in some neighbourhood of , and calculate for based on Equation (Equation31). The conditional density function of is estimated by where , is the ιth component of . The cumulative distribution function of missing response is estimated by for . Similarly, we first sample a random variable u from , and define as the th iteration value of . Thus, .

Step 5. Repeat Steps 2–4 until convergence of the algorithm. Bayesian estimates and of θ and can be taken to be their corresponding sample means of the generated observations.

5. Influence analysis of EFNSEMs with MNAR

5.1. Local influence analysis

Let be a vector of perturbations restricted to some open subspace Ξ, and be the complete-data log-likelihood of the perturbed model. Let be the ML estimate of θ for the perturbed model. That is, is obtained by maximising . We assume that there is a such that for all θ. Thus, we have . To wit, represents no perturbation. Following Zhu and Lee (Citation2001), we consider the following Q-displacement function . Similar to Cook (Citation1986), we define the influence graph of as . Thus, the normal curvature of in the unit direction () at can be defined as where , , and . It follows from Poon and Poon (Citation1999) that the conformal normal curvature in the unit direction at can be expressed as (32) Let , and be the nonzero eigenvalues of , and be the corresponding orthogonal eigenvectors. Similar to Cook (Citation1986), the eigenvector corresponding to the largest eigenvalue is important for assessing local influence. However, as many authors (e.g., Lesaffre & Verbeke, Citation1998; Poon & Poon, Citation1999) pointed out that it is not enough to assess local influence by only examining . To address the issue, Poon and Poon (Citation1999) proposed using statistic to assess local influence, where for . It is easily seen that for , where is the jth component of , is an vector of perturbations with its jth entry 1 and zeros elsewhere, and is the jth diagonal element of matrix . Thus, evaluating is rather simple because of without computing eigenvectors or eigenvalues of matrix . Let and SM be mean and standard error of , respectively. Similar to Lee and Tang (Citation2004), is used as a benchmark, where c is a selected constant on a problem-by-problem basis. To wit, the jth observation is identified as influential if .

As Lee and Tang (Citation2004) pointed out, it is quite difficult to evaluate and because of multiple integrals involved. To this end, Lee and Tang (Citation2004) employed the well-known Monte Carlo method to approximate and . Let be a set of observations generated from the conditional distributions and given in Equations (Equation19) and (Equation20) via the MH algorithm given in Section 4.2.

Thus, and can be approximated by According to the above introduced EM algorithm, at the E-step of the last MCECM iteration, a sample has been generated at the ML estimate . If , no additional observations are required; otherwise the MH algorithm is used again to sample additional observations to make up a sufficient large sample for local influence analysis. Generally, is selected on a problem-to-problem basis. In some applications, a few thousand (e.g., 5000) observations are sufficient for models with moderate sample sizes.

The above presented local influence measures are mainly defined on the basis of and , which are closely associated with , and , . It follows from Equation (Equation13) that the complete-data log-likelihood function , is a relatively simple function of three separate terms with separable parameters. Hence, is a diagonal block matrix, and many components of are zero for many perturbed models. Thus, the computation of the above defined local influence diagnostic measures is quite simple. As an illustration, six perturbation schemes are considered as follows.

(1) Case weights perturbation

We first consider simultaneous changes in the weights of all subjects. Let be an vector of case-weights. In this case, represents no perturbation, and the complete-data log-likelihood function for the perturbed model is written as (33) where , , . In particular, when and for , we obtain , where is the ML estimate of θ after deleting the ith data point. Thus, this perturbation scheme is an extension of the case-deletion method.

(2) Additive perturbation on components of the manifest vectors

Here we consider an additive perturbation on the manifest vector, i.e., , where are the kth component of . In this case, , , and corresponds to a case with no perturbation. The complete-data log-likelihood function for the perturbed model is given by (Equation33), but , with . In particular, we consider a perturbation on the manifest vector , i.e., , where is a scalar, and .

For the above two perturbation schemes, and , in are nonzero, while other two terms are zero. Moreover, only involves , and and and . Thus, the computational burden for evaluating diagnostic measures is not too heavy. Also, the two perturbations give us more precise and direct insights with clear interpretations.

(3) Perturbation on

Consider perturbation on using a vector such that . In this case, represents no perturbation. The perturbed complete-data log-likelihood function is given by (34) where . It is easily seen from Equation (Equation34) that is the unique nonzero derivative in , and computation of does not involve , and . Hence, local influence diagnostics for this perturbation scheme only depend on and , and the nonlinear SEM.

(4) Multiplicative perturbation on the latent vector

This perturbation scheme has been considered by Lee and Tang (Citation2004). Following Lee and Tang (Citation2004), we consider perturbation on : for , where . This perturbation scheme includes either the multiplicative perturbation on the latent subvector or by using or , respectively. In this case, represents no perturbation. The complete-data log-likelihood function for the perturbed model is given by where

(5) Additive perturbation on the latent vector

This perturbation scheme has been considered by Lee and Tang (Citation2004). Following Lee and Tang (Citation2004), we consider an additive perturbation on the latent vector by taking . In this case, represents no perturbation. The perturbed complete-data log-likelihood is given by where The above three perturbation schemes are closely associated with the nonlinear SEM (Equation10) that measures the relationships among the latent variables. These perturbation schemes can be used to detect the effect of the latent variables on the manifest observations.

(6) Perturbation on all unknown parameters

Similar to Lee and Tang (Citation2004), we consider perturbation on all unknown parameters. In this case, the perturbed parameter vectors are given by , , , , where , and are diagonal matrices of appropriate orders, and is a matrix that can be determined by users; and is the perturbed vector corresponding to for i=1,2, and and are the perturbed vectors corresponding to Φ and ϕ, respectively. In this case, represent no perturbation for i=1,2,3,4. The complete-data log-likelihood function of the perturbed model has the form of Therefore, we have where is a diagonal block matrix with blocks , , and .

5.2. Case-deletion analysis

In what follows, a quantity with a subscript ‘’ denotes the quantity evaluated from a reduced data set, which is obtained by deleting the ith observation from the full sample . For example, denotes the ML estimate of θ with the ith data point deleted. That is, is obtained by maximising the objective function , where represents sub-matrix of with the ith data pointed deleted.

To assess the effect of the ith individual on the ML estimate of θ, we consider the following generalised Cook's distance where . Similar to the definition of likelihood displacement (Zhu & Lee, Citation2001), we define Q-function-based displacement as which quantifies the change of Q-function after deleting the ith observation . Generally, when or are large, the ith observation has a relatively large effect on the ML estimate of θ.

To compute and , we require first evaluating , which indicates that it is necessary to re-run the above presented ECME algorithm for each of n observations. Clearly, it is computationally intensive when sample size n is large. To address the issue, we consider the following one-step approximation of : (35) where is the one-step approximation of , and . Proof of Equation (Equation35) see Zhu and Lee (Citation2001). Using to replace in and leads to the following approximations and of and : (36) (QD1) (37) (D1) respectively.

To evaluate and , it is necessary to calculate , , and . But all the four quantities have no closed forms for our considered model. To this end, the Monte Carlo approximation is employed. Let be observations of and ω drawn from the joint conditional distribution via the above presented MH algorithm within the Gibbs sampler. Thus, and , and can be approximated by (38) (QQ1) (39) (QQ2) (40) (QQ3) (41) (QQ4) respectively.

The main steps in implementing the above proposed case deletion influence analysis are summarised as follows.

Step (A). Based on observations generated from the joint conditional distribution via the above presented MH algorithm within the Gibbs sampler, calculate , , , and using the approximations given in Equations (Equation35), (Equation38)–(Equation41) for.

Step (B). Calculate and via the formulae (Equation36) and (Equation37) for .

Step (C). Identify influential observations by inspecting index plots of and based on the corresponding benchmarks, respectively.

5.3. Bayesian local influence analysis

Section 5.1 discusses local influence analysis of EFNSEMs with MNAR based on the Q function of the EM algorithm, but it does not consider perturbation to prior of θ. To this end, we develop a Bayesian local influence approach to assess the effect of simultaneously minor perturbations to the data, the prior and the sampling distribution on the posterior quantities of interest in EFNSEMs with MNAR.

(1) Bayesian perturbation model and manifold

Similar to Zhu et al. (Citation2011), we define a class of perturbation models for simultaneously perturbing to the data, the prior and the sampling distribution as and , where in which , and represent perturbations to the prior, the data and the sampling distribution, respectively. Denote . It is assumed that represents no perturbation.

Following Zhu et al. (2011), the perturbed model can be regarded as a -dimentional manifold under some regularity conditions. The tangent space of at each is composed of functions , where , , and is the kth component of υ. Thus, under some regular conditions, the quantities (42) form the metric tensor of (denoted by , where denotes the expectation taken with respect to the joint probability density function . The th component of measures the amount of perturbation characterised by , while reflects the amount of association between and . If is a diagonal matrix, components of υ are orthogonal to each other and the corresponding perturbation is called an appropriate perturbation. If is not a diagonal matrix, we take the following new perturbation vector such that computed at is equal to , where c is a positive scalar. As an illustration, we consider the following perturbations.

Experiment 1. Consider perturbation to the priors of u, , , , , Φ and ϕ by assuming , , , , , , , and , where , , , , , and are positive scalars, and . Thus, , and indicates no perturbation. The perturbation model constitutes a Riemannian manifold. The tangent space of is consist of Moreover, it is easily shown that

Experiment 2. Consider the perturbation to by assuming . In this case, , and represents no perturbation. The perturbation model forms a manifold. Denote , in which for . The tangent space of is spanned by n quantities: It is easily shown that , where with , and represents the expectation taken with respect to the joint distribution of .

Experiment 3. In this experiment, we consider the perturbation to the sampling distribution by assuming where , is a normalised constant, is a fixed scalar function having zero mean under . This perturbation scheme can be intuitively explained as the perturbation to mean and variance of the normal distribution for , i.e., . In this case, represents no perturbation. The tangent space of is spanned by n quantities: for . It is easily shown that where in which represents the expectation taken with respect to the joint distribution .

Experiment 4. Consider the perturbation to missing data mechanism model (Equation12) by assuming where is defined in Experiment 2, and . Thus, represents no perturbation. The tangent space of the manifold is spanned by . Again, we have which represents the expectation taken with respect to the joint distribution .

(2) Bayesian Local influence measures

Consider the objective function , such as Bayes factor, φ-divergence and posterior mean distance. Let be a geodesic on with . Taking the first-order Taylor expansion of at t=0 leads to , where , , and .

Following Zhu et al. (Citation2011), when , we define the first-order influence (FI) measure in the direction as where and is a positive semi-definite matrix. For the appropriate perturbation , in the unit direction at can be rewritten as Similar to Poon and Poon (Citation1999), we define the first-order adjusted influence measure in the unit direction h at as where with . From the above definition, it is easily seen that to evaluate the first-order adjusted local influence measure , we only require calculating and and selecting an appropriate matrix .

Following the argument given in Section 5.1, for , where is the jth diagonal element of matrix . Again, is used as a benchmark, where and are the mean and standard error of , respectively.

Experiment 5. Consider the logarithm of Bayes factor for comparing υ with on the basis of the observed data : , where . If we take , we have and when . Again, it is rather difficult to calculate . To address the issue, the Monte Carlo approximation method is here adopted. To wit, can be approximated by where observations are generated from the joint posterior distribution via the MH algorithm within the Gibbs sampler.

If , it follows from the Taylor's series expansion that , where and . Following Zhu et al. (2011), The second-order influence measure (SI) in the unit direction is defined as In particular, for an appropriate perturbation , reduces to Thus, the second-order adjusted influence measure at in the unit direction h is defined as where in which . Similarly, diagonal elements of matrix can be used to identify the potential influential observations, misspecified or inappropriate sampling distributions, misspecified missingess data mechanism and misspecified priors. Moreover, is used as a benchmark, where and are the mean and standard error of , respectively.

Experiment 6. We take the objective function as the φ-divergence between two posterior probability density functions before and after introducing perturbation υ, which is defined by where , , and is a convex function with such as the Kullback-Leibler divergence. It is easily shown that and where and is the expectation taken with respect to the joint posterior distribution . In practice, can be approximated by where observations are generated from the joint posterior distribution via the MH algorithm within the Gibbs sampler.

Experiment 7. Define and as the posterior means of before and after introducing υ, respectively, where is some known function of θ. We define Cook's posterior mean distance for characterising the effect of υ on the posterior mean of as where . If we take , we have and , where , which can be approximated by where observations are generated from the joint posterior distribution via the MH algorithm within the Gibbs sampler.

For the above considered perturbation schemes and objective functions, the following four key steps are employed to conduct Bayesian influence analysis.

Step 1. Construct a Bayesian perturbation model .

Step 2. Take an objective function , and calculate , and .

Step 3. If , we calculate the first-order adjusted influence (FIC) measure . However, when , we calculate the second-order adjusted influence (SIC) measure.

Step 4. If (or ), the jth observation is detected as influential, where for k=F and s.

5.4. Bayesian case deletion measures

In this subsection, we consider the effect of deleting a set of observations (denoted as S) on posterior inferences on θ from a Bayesian point of view. In fact, this problem has been investigated in the frequientiest framework, for example, see Section 5.2. A subscript ‘[S]’ represents the relevant quantity with all observations in the set S deleted. Let and be subsamples of and consisting of all the observations in the set S, respectively. The posterior distribution of θ for the observed data set is given by.

Following Zhu et al. (Citation2012), to quantify the effect of deleting all the observations in the set S on the posterior distribution of θ, we define the following Bayesian case influence measure: where , and is a convex function with . is an measure of the distance between two posterior distributions and , and the observations in the data set S are identified as a set of influential observations when is large. Generally, can take various forms, for example, for , for and for corresponding to the Kullback-Leibler divergence (K-L divergence), corresponding to the symmetric K-L divergence, and corresponding to the-distance and the -divergence, respectively.

To measure the effect of deleting all the observations in data set S on the posterior mean of θ, we define the second type of Bayesian case influence measure as (43) where is a user-specified positive definite matrix, and are the posterior means of θ based on the observed data sets and , respectively. Generally, we take as the inverse of the posterior covariance matrix of θ. A large value of indicates that deleting all the observations in data set S leads to a large effect on the posterior mean of θ. Thus, the observations in the data set S are identified as influential.

Although and measure the effect of the data set S on the posterior quantities, their statistical explanations are not the same. assesses the effect of the data set S on the overall posterior distribution, including shape, mode and mean, and is sensitive to posterior distribution. But only measures the effect of the data set S on the posterior mean of θ, and is sensitive to the change of posterior mean.

To compute and , we require calculating the posterior densities and . It is rather difficult to compute them because of latent variables and missing manifest variables involved. To this end, it is desirable to develop an algorithm or a formula computed easily. We define the ratio of the observed data likelihood of θ with and without the data set S as It is easily shown that (44) which indicates that . Thus, can be written as (45) where is the expectation taken with respect to the posterior distribution . In particular, for the K-L divergence (i.e., , is expressed as .

It follows from Equation (Equation44) and the definitions of and that and , which indicates that can be evaluated by using MCMC samples from the full posterior distribution . In particular, can be computed by averaging the MCMC samples of θ and can be approximated by the inverse of the posterior covariance matrix of the MCMC samples of θ. From the above discussion, it is easily known that computing Equations (Equation43) and (Equation45) heavily depend on the computation of.

According to the definition of , we have which shows that it is very cumbersome to approximate . Thus, it is rather unfeasible to compute and because of an intractable multiple integral involved. Thus, some numerical methods such as the trapezoidal rule can be adopted to approximate.

On the other hand, similar to Zhu et al. (Citation2012), one can develop a computationally feasible first-order approximation to and . In particular, under some regularity conditions, can be approximated by and the one-step approximation for is given by where , at . It follows from that which show that and can be approximated by where observations are drawn from the conditional distribution .

6. Model Selection of EFNSEMs with MNAR

6.1. EM-based model selection criterion

By the discussion of Section 4.1, the EM algorithm includes the E-step and the M-step. At the κth iteration of the EM algorithm with the current value of θ, the E-step requires evaluating the Q-function, which is defined as (46) where represents the expectation taken with respect to the conditional distribution , and (47) is called the H-function. The M-step is to maximise to evaluate . When the EM algorithm converges, we can obtain the following three quantities: , , and observations drawn from the joint conditional distribution . According to the above discussion, has no closed expression. Hence, it is impossible to directly evaluate . But, it follows from (Equation46) that , which shows that can be computed from the Q-function and the H-function when the EM algorithm converges. Thus, inspired by AIC/BIC criterion, we consider the following model selection criterion: (48) where is a penalty function of the data and the fitted model. Different expressions of correspond to different criteria, for instance, when , where d is the dimension of θ, we obtain the AIC; when , the criterion reduces to the BIC. Generally, is selected by users.

It is easily seen from the definition of that is not a direct byproduct of the EM algorithm, and has no closed form. To address the issue, a truncated Hermite approximation is adopted to approximate . The details see Ibrahim et al. (Citation2008).

Since it is rather cumbersome to evaluate the approximation to the integrand of the H-function, it is desirable to present a model selection criterion that does not involve the H-function and only depends on the byproducts of the EM algorithm. To this end, we consider the following model selection criterion by dropping the H-function from (Equation48): which can be regarded as a reduced form of . When , reduces to the criterion . When , reduces to the criterion .

As Ibrahim et al. (Citation2008) pointed out, the advantage of using is that it is more computationally feasible than , and the disadvantage of is that it results in a criterion with poor model selection properties in some settings where the missing-data proportion is high. Generally, we recommend the usage of over.

6.2. Penalised-EM-based model selection criterion

The above presented EM-based algorithm model selection criterion often suffer from instability and computationally infeasible when the number of the latent variables and covariates is large. To address the issue, various penalisation-based approaches have been proposed to simultaneously estimate parameters and select important latent variables and covariates over the past years. Recently, Tang and Tang (Citation2018) extended the penalised-based approaches to generalised partially nonlinear models with nonignorable missing responses. Here, we extend the penalised-based approaches to EFNSEMs with nonignorable missing data.

Similar to Tang and Tang (Citation2018), at the κth iteration of the EM algorithm with the current value of θ, the E-step requires evaluating the following penalised Q-function: where and are the numbers of different parameters in and , respectively, , and , and are the th, th and th components of Φ, , is the penalty parameter corresponding to the jth element in for , is the penalty parameter corresponding to the jth element in for, is the penalty parameter corresponding to the jth component in Φ for j=1,2,3, while is the penalty parameter corresponding to the jth element in ϕ for , and , , , and , where , , , and .

The M-step is to maximise , which is a quite difficult task in that is a non-differentiable and non-concave function of θ. To address the issue, following Fan and Li (Citation2001), we maximise the second-order Taylor expansions of , , and at , , and , respectively. Thus, the problem of maximising , , and becomes an optimisation problem of the penalised weighted least squares regression, which can be conducted via some appropriate optimisation algorithms such as the local quadratic approximation algorithm (Fan & Li, Citation2001), local linear approximation algorithm (Zou & Li, Citation2008). Note that the estimators obtained by maximising with respect to , with respect ton , with respect to Φ and with respect to ϕ, respectively, are not the maximiser of with respect to θ. To address the issue, we here present a hybrid algorithm combining the local linear approximation algorithm and the expectation conditional maximisation algorithm (Meng & Rubin, Citation1993) to find such that . Iterating the above introduced E-step and the adjusted M-step until the convergence of the EM algorithm leads to the maximum penalised likelihood estimation of θ.

To guarantee that the obtained estimator of θ possesses the well-known oracle property, we require selecting an appropriate penalty parameter λ. Generally, the generalised cross-validation (GCV) and BIC can be employed to select λ. But it is quite difficult to evaluate these criteria in the presence of missing data because of intractable multiple integral involved. Here the above presented criterion is used to choose the penalty parameter λ. To wit, the optimal penalty parameter λ can be evaluated by minimising where is defined in (Equation48).

Without loss of generality, we assume , where and correspond to the nonzero and zero components of θ, respectively. Let be the true value of θ. Thus, . We write the corresponding decomposition of as , where and correspond to the nonzero and zero components of θ. Similar to Tang and Tang (Citation2018), under some regularity conditions, we can show

  1. (Consistency) as ;

  2. (Sparsity) Pr(.

6.3. Resampling-based variable selection procedure

Following Long and Johnson (Citation2015), we present a resampling approach combining bootstrap imputation and stability selection to select variables in EFNSEMs with nonignorable missing data.

First, we conduct bootstrap imputation on before variable selection, which is given as follows. We generate bootstrap data sets with the corresponding missing indicator sets based on the observed data set ; and then conduct imputation for each bootstrap data set and using a selected imputation method, which leads to the imputed data set .

Second, given the bootstrap imputed data sets , we consider the following stability selection procedure. We evaluate by maximising for the kth bootstrap imputed data set ; repeat the above procedure for all bootstrap imputed data sets and obtain . Let be the support of (the set of non-zero parameter estimates), which is also called as the estimated active set. The final estimated active set is defined as , where , and is a threshold and is taken as a value between 0.6 and 0.9. Based on the selected active set , we calculate the final parameter estimates via , where is computed by maximising for the kth bootstrap imputed data set , which is the partial data-set of corresponding to .

7. Conclusions and discussions

This paper reviews statistical inference on mean functionals and EFNSEMs in the presence of missing response not at random. Two popular approaches including the EL method, and semiparametric method have been introduced to estimate the mean functionals in the presence of missing response not at random. EM algorithm, Bayesian approach, and Bayesian EL method are presented to estimate parameters in EFNSEMs with nonignorable missing data. Local influence analysis and case-deletion method are developed to assess the effect of minor perturbations on the quantities of interest, and identify the potential influential observations from the frequentist and Bayesian viewpoints, respectively. Also, three model selection approaches, including EM-based model selection criterion, penalised-EM-based model selection criterion, and resampling-based variable selection procedure, are proposed to select the best model among candidate EFNSEMs with nonignorable missing data. Although we mainly review the developments in EFNSEMs with MNAR, it is quite easy to extend these approaches to other models with MNAR.

Recently, Fang and Shao (Citation2016) presented a new model selection criterion, which is referred to as the penalised validation criterion, in the presence of nonignorable nonresponse with unspecified propensity score using the information included in an instrument. A natural idea is to extend the penalised validation criterion of Fang and Shao (Citation2016) to EFNSEMs in the presence of nonignorable missing data with unspecified propensity score. To our best of knowledge, there is little work done on this topic. Also, the aforementioned four model selection methods mainly focus on the setting that the number of explanatory variables is fixed and less than or equal to sample size. Recently, growing-dimensional data analysis receives a huge attention in the absence of missing data. For example, see Fan and Peng (Citation2004) for the nonconcave penalised likelihood in linear regression models; Lam and Fan (Citation2008) for the profile-kernel likelihood in linear regression models; Zou and Zhang (Citation2009) for an adaptive elastic-net procedure in linear regression models; Li, Peng, and Zhu (Citation2011) for a nonconcave penalised M-estimator in linear regression models; Caner and Zhang (Citation2014) for the least squares based adaptive elastic net estimator in generalised method of moments (GMMs); Leng and Tang (Citation2012) for the penalised EL method in estimating equations when likelihood function is unavailable, which is sensitive to model misspecification. To this end, Tang, Yan, and Zhao (Citation2018) presents a penalised exponentially tilted (PET) likelihood for growing dimensional unconditional moment models in the presence of correlation among variables and model misspecification. Hence, extension of the aforementioned model selection techniques to growing-dimensional models with missing data in the presence of model misspecification and outliers is an interesting topic.

Recently, statistical analysis of high/ultrahigh dimensional data, in which the number of candidate explanatory variables increases at an exponential rate of sample size but only a small number of explanatory variables have a relatively large effect on the response, is widely studied. Usually, a two-steps procedure is adopted to analyze the kind of datasets. To wit, the first step is to compress the important explanatory variables to a small number, and the second step is to use the existing penalised variable selection techniques, such as Lasso, SCAD and Adaptive Lasso, to identify the subset of important explanatory variables. To efficiently implement the first step, many feature screening procedures in the absence of missing data have been proposed. For example, see the sure independence screening (SIS) and iterated sure independence screening (ISIS) method (Fan & Lv, Citation2008) for linear regression models, the sure independent ranking and screening procedure (Zhu, Li, Li, & Zhu, Citation2011) for multiple-index models, the nonparametric independence screening method (Fan, Feng, & Song, Citation2011), the robust rank correlation screening procedure (Li, Zhong, & Zhu, Citation2012) for semiparametric models, the conditional-correlation-based feature screening procedure for varying coefficient models, the GEE-based screening procedure (Xu & Chen, Citation2014), the score-test-based screening framework (Zhao & Li, Citation2014), the Pearson correlation sure independence screening procedure based on the partial residual for PLMs Liu, Lee, Zhao (Citation2016). Also, several model-free screening procedures in the absence of missing data have been developed in recent years, for example, see Li et al. (Citation2012), Cui, Li, and Zhong (Citation2015), among others. When responses are subject to missingness, Ding and Wang (Citation2011) presented a Fusion-Refinement procedure for dimension reduction; Lai, Liu, Liu, and Wan (Citation2017) presented a model-free screening procedure combining the IPW method and the Kolmogorov filter method. But, Lai et al.'s (Citation2017) method requires estimating unknown propensity score function, which indicates that it may be subject to the potential misspecification of the respondent probability. Also, Wang and Li (Citation2018) proposed a missing indicator imputation screening procedure for ultrahigh dimensional data in the presence of missing data, which may suffer from the well-known ‘curse of dimensionality’. However, it is a quite challenging and interesting topic to develop a general model-free approach to implement the first step in the presence of nonignorable missing data.

Although we present several approaches to identify the potential influential observations from the frequentist and Bayesian viewpoints, the aforementioned approaches have been developed for regression models with the fixed dimension of explanatory variables. Hence, it is quite interesting to develop some new approaches to detect the potential influential observations in growing-dimensional or high/ultrahigh dimensional models with nonignorable missing data.

Acknowledgements

The authors are grateful to the Editor and two referees for their valuable suggestions that greatly improved the manuscript.

Disclosure statement

No potential conflict of interest was reported by the authors.

Additional information

Funding

This work was supported by the grants from the National Natural Science Foundation of China (Grant No.: 11671349) and the Key Projects of the National Natural Science Foundation of China (Grant No.: 11731101).

Notes on contributors

Niansheng Tang

Niansheng Tang is a Professor in Key Lab of Statistical Modeling and Data Analysis of Yunnan Province, Yunnan University.

Yuanyuan Ju

Yuanyuan Ju is a Ph.D. student in Key Lab of Statistical Modeling and Data Analysis of Yunnan Province, Yunnan University.

References

  • Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. Breakthroughs in statistics. New York: Springer.
  • Bentler, P. M., & Wu, E. J. C. (2002). EQS6.0 for windows user guide. Encino, CA: Multivariate Software.
  • Berger, J. O. (1994). An overview of robust Bayesian analysis. Test, 3, 5–124. doi: 10.1007/BF02562676
  • Caner, M., & Zhang, H. H. (2014). Adaptive elastic net for generalized methods of moments. Journal of Business & Economic Statistics, 32, 30–47. doi: 10.1080/07350015.2013.836104
  • Chang, T., & Kott, P. S. (2008). Using calibration weighting to adjust for nonresponse under a plausible model. Biometrika, 95, 555–571. doi: 10.1093/biomet/asn022
  • Chen, S. X., & Kim, J. K. (2017). Semiparametric fractional imputation using empirical likelihood in survey sampling. Statistical Theory and Related Fields, 1, 69–81. doi: 10.1080/24754269.2017.1328244
  • Chen, J., Sitter, R. R., & Wu, C. (2002). Using empirical likelihood methods to obtain range restricted weights in regression estimators for surveys. Biometrika, 89, 230–237. doi: 10.1093/biomet/89.1.230
  • Cheng, P. E. (1994). Nonparametric estimation of mean functionals with data missing at random. Journal of the American Statistical Association, 89, 81–87. doi: 10.1080/01621459.1994.10476448
  • Cook, R. D. (1977). Detection of influential observation in linear regression. Technometrics, 19, 15–18.
  • Cook, R. D. (1986). Assessment of local influence. Journal of the Royal Statistical Society Series B, 48, 133–169.
  • Cook, R. D., & Weisberg, S. (1982). Residuals and influence in regression. New York: Chapman and Hall.
  • Cui, H. J., Li, R. Z., & Zhong, W. (2015). Model-free feature screening for ultrahigh dimensional discriminant analysis. Journal of the American Statistical Association, 110, 630–641. doi: 10.1080/01621459.2014.920256
  • Daniels, M. J., & Hogan, J. W. (2008). Missing data in longitudinal studies: Strategies for Bayesian modeling and sensitivity analysis. London: Chapman and Hall.
  • Daniels, M. J., Wang, C., & Marcus, B. H. (2014). Fully Bayesian inference under ignorable missingness in the presence of auxiliary covariates. Biometrics, 70, 62–72. doi: 10.1111/biom.12121
  • Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B, 39, 1–38.
  • Dey, D. K., Ghosh, S. K., & Lou, K. R. (1996). On local sensitivity measures in Bayesian analysis. Bayesian Robustness, 29, 21–40. doi: 10.1214/lnms/1215453059
  • Ding, X. B., & Wang, Q. H (2011). Fusion-refinement procedure for dimension reduction with missing response at random. Journal of the American Statistical Association, 106, 1193–1207. doi: 10.1198/jasa.2011.tm10573
  • Fan, J., Feng, Y., & Song, R. (2011). Nonparametric independence screening in sparse ultra-high-dimensional additive models. Journal of the American Statistical Association, 106, 544–557. doi: 10.1198/jasa.2011.tm09779
  • Fan, J., & Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96, 1348–1360. doi: 10.1198/016214501753382273
  • Fan, J., & Lv, J. C. (2008). Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B, 70, 849–911. doi: 10.1111/j.1467-9868.2008.00674.x
  • Fan, J., & Peng, H. (2004). Nonconcave penalized likelihood with a diverging number of parameters. Annals of Statistics, 32, 928–961. doi: 10.1214/009053604000000256
  • Fang, F., & Shao, J. (2016). Model selection with nonignorable nonresponse. Biometrika, 103, 861–874. doi: 10.1093/biomet/asw039
  • Gelman, A. (1996). Inference and monitoring convergence. In W. R. Gilks, S. Richardson, & D. J. Speigelhalter (Eds.), Markov chain Monte Carlo in practice (pp. 131–144). London: Chapman and Hall.
  • Geman, S., & Geman, D. (1984). Stochastic relaxation, Gibbs distribution, and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6, 721–741. doi: 10.1109/TPAMI.1984.4767596
  • Gustafson, P. (1996a). Local sensitivity of posterior expectations. Annals of Statistics, 24, 174–195. doi: 10.1214/aos/1033066205
  • Gustafson, P. (1996b). Local sensitivity of inferences to prior marginals. Journal of the American Statistical Association, 91, 774–781. doi: 10.1080/01621459.1996.10476945
  • Hansen, L. P. (1982). Large sample properties of generalised method of moments estimators. Econometrica, 50, 1029–1054. doi: 10.2307/1912775
  • Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57, 97–109. doi: 10.1093/biomet/57.1.97
  • Hens, N., Aerts, M., & Molenberghs, G. (2006). Model selection for incomplete and design-based samples. Statistics in Medicine, 25, 2502–2520. doi: 10.1002/sim.2559
  • Horvitz, D. G., & Thompson, D. J. (1952). A generalization of sampling without replacement from a finite universe. Journal of the American statistical Association, 47, 663–685. doi: 10.1080/01621459.1952.10483446
  • Ibrahim, J. G. (1990). Incomplete data in generalized linear models. Journal of the American Statistical Association, 85, 765–769. doi: 10.1080/01621459.1990.10474938
  • Ibrahim, J. G., Chen, M. H., & Lipsitz, S. R. (2001). Missing responses in generalised linear mixed models when the missing data mechanism is nonignorable. Biometrika, 88, 551–564. doi: 10.1093/biomet/88.2.551
  • Ibrahim, J. G., Chen, M. H., Lipsitz, S. R., & Herring, A. H. (2005). Missing-data methods for generalized linear models. Journal of the American Statistical Association, 100, 332–346. doi: 10.1198/016214504000001844
  • Ibrahim, J. G., Zhu, H. T., & Tang, N. S. (2008). Model selection criterion for missing data problems using the EM algorithm. Journal of the American Statistical Association, 103, 1648–1658. doi: 10.1198/016214508000001057
  • Jansen, I., Molenberghs, G., Aerts, M., Thijs, H., & Van Steen, K. (2003). A local influence approach applied to binary data from a psychiatric study. Biometrics, 59, 410–419. doi: 10.1111/1541-0420.00048
  • Jiang, J. M., Nguyen, T., & Rao, J. S. (2015). The E-MS algorithm: Model selection with incomplete data. Journal of the American Statistical Association, 110, 1136–1147. doi: 10.1080/01621459.2014.948545
  • Jiang, D. P., Zhao, P. Y., & Tang, N. S. (2016). A propensity score adjustment method for regression models with nonignorable missing covariates. Computational Statistics and Data Analysis, 94, 98–119. doi: 10.1016/j.csda.2015.07.017
  • Jöreskog, K. G., & Sörbom, D. (1996). LISREL 8: Structural equation modeling with the SIMPLIS command language. Hover: Erlbaum.
  • Kaciroti, N. A., & Raghunathan, T. (2014). Bayesian sensitivity analysis of incomplete data: Bridging pattern-mixture and selection models. Statistics in Medicine, 33, 4841–4857. doi: 10.1002/sim.6302
  • Kim, J. K., & Shao, J. (2013). Statistical methods for handling incomplete data. Journal of Applied Statistics, 41, 2779–2780.
  • Kim, J. K., & Yu, C. L. (2011). A semiparametric estimation of mean functionals with nonignorable missing data. Journal of the American Statistical Association, 106, 157–165. doi: 10.1198/jasa.2011.tm10104
  • Lai, P., Liu, Y. M., Liu, Z., & Wan, Y. (2017). Model free feature screening for ultrahigh dimensional data with responses missing at random. Computational Statistics and Data Analysis, 105, 201–216. doi: 10.1016/j.csda.2016.08.008
  • Laird, N. M., & Ware, J. H. (1982). Random-effects models for longitudinal data. Biometrics, 38, 963–974. doi: 10.2307/2529876
  • Lam, C., & Fan, J. (2008). Profile-kernel likelihood inference with diverging number of parameters. Annals of Statistics, 36, 2232–2260. doi: 10.1214/07-AOS544
  • Lavine, M. (1991). Sensitivity in Bayesian statistics: The prior and the likelihood. Journal of the American Statistical Association, 86, 396–399. doi: 10.1080/01621459.1991.10475055
  • Lee, S. Y., & Tang, N. S. (2004). Local influence analysis of nonlinear structural equation models. Psychometrika, 69, 573–592. doi: 10.1007/BF02289856
  • Lee, S. Y., & Tang, N. S. (2006a). Bayesian analysis of structural equation models with mixed exponential family and ordered categorical data. British Journal of Mathematical and Statistical Psychology, 59, 151–172. doi: 10.1348/000711005X81403
  • Lee, S. Y., & Tang, N. S. (2006b). Bayesian analysis of nonlinear structural equation models with nonignorable missing data. Psychometrika, 71, 541–564. doi: 10.1007/s11336-006-1177-1
  • Lee, S. Y., & Tang, N. S. (2006c). Analysis of nonlinear structural equation models with nonignorable missing covariates and ordered categorical data. Statistica Sinica, 16, 1117–1141.
  • Lee, S. Y., & Zhu, H. T. (2002). Maximum likelihood estimation of nonlinear structural equation models. Psychometrika, 67, 189–210. doi: 10.1007/BF02294842
  • Leng, C., & Tang, C. Y. (2012). Penalized empirical likelihood and growing dimensional general estimating equations. Biometrika, 99, 703–716. doi: 10.1093/biomet/ass014
  • Lesaffre, E., & Verbeke, G. (1998). Local influence in linear mixed models. Biometrics, 54, 570–582. doi: 10.2307/3109764
  • Li, G., Peng, H., Zhang, J., & Zhu, L. X. (2012). Robust rank correlation based screening. The Annals of Statistics, 40, 1846–1877. doi: 10.1214/12-AOS1024
  • Li, G., Peng, H., & Zhu, L. (2011). Nonconcave penalized M-estimation with a diverging number of parameters. Statistica Sinica, 21, 391–419.
  • Li, R., Zhong, W., & Zhu, L. P. (2012). Feature screening via distance correlation learning. Journal of the American Statistical Association, 107, 1129–1139. doi: 10.1080/01621459.2012.695654
  • Liang, H., Wang, S., & Carroll, R. J. (2007). Partially linear models with missing response variables and error-prone covariates. Biometrika, 94, 185. doi: 10.1093/biomet/asm010
  • Linero, A. R., & Daniels, M. J. (2015). A flexible Bayesian approach to monotone missing data in longitudinal studies with nonignorable missingness with application to an acute Schizophrenia clinical trial. Journal of the American Statistical Association, 110, 45–55. doi: 10.1080/01621459.2014.969424
  • Little, R. J. A., & Rubin, D. B. (2002). Statistical analysis with missing data. Hoboken, NJ: Wiley.
  • Liu, T. Q., Lee, K. Y., & Zhao, H. Y. (2016). Ultrahigh dimensional feature selection via Kernel canonical correlation analysis. arXiv preprint arXiv:1604.07354.
  • Liu, C., & Rubin, D. B. (1994). The ECME algorithm: A simple extension of em and ecm with faster monotone convergence. Biometrika, 81, 633–648. doi: 10.1093/biomet/81.4.633
  • Long, Q., & Johnson, B. A. (2015). Variable selection in the presence of missing data: Resampling and imputation. Biostatistics (Oxford, England), 16, 596–610. doi: 10.1093/biostatistics/kxv003
  • Louis, T. A. (1982). Finding the observed information matrix when using the EM algorithm. Journal of the Royal Statistical Society. Series B, 44, 226–233.
  • McCullagh, P., & Nelder, J. A. (1989). Generalized linear models. London: Chapman & Hall.
  • Meng, X. L., & Rubin, D. B. (1993). Maximum likelihood estimation via the ECM algorithm: A general framework. Biometrika, 80, 267–278. doi: 10.1093/biomet/80.2.267
  • Meng, X. L., & Schilling, S. (1996). Fitting full-information item factor models and an empirical investigation of bridge sampling. Journal of the American Statistical Association, 91, 1254–1267. doi: 10.1080/01621459.1996.10476995
  • Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., & Teller, E. (1953). Equation of state by fast computing machines. Journal of Chemical Physics, 21, 1087–1092. doi: 10.1063/1.1699114
  • Millar, R. B., & Stewart, W. S. (2007). Assessment of locally influential observations in Bayesian models. Bayesian Analysis, 2, 365–383. doi: 10.1214/07-BA216
  • Muthén, L. K., & Muthén, B. O. (2017). Mplus user's guide (8th ed.). Los Angeles, CA: Muthén & Muthén.
  • Owen, A. B. (1988). Empirical likelihood ratio confidence intervals for a single functional. Biometrika, 75, 237–249. doi: 10.1093/biomet/75.2.237
  • Owen, A. B. (2001). Empirical likelihood. New York: Chapman and Hall/CRC.
  • Poon, W. Y., & Poon, Y. S. (1999). Conformal normal curvature and assessment of local influence. Journal of the Royal Statistical Society Series B, 61, 51–61. doi: 10.1111/1467-9868.00162
  • Qin, J., Leung, D., & Shao, J. (2002). Estimation with survey data under nonignorable nonresponse or informative sampling. Journal of the American Statistical Association, 97, 193–200. doi: 10.1198/016214502753479338
  • Qin, J., Zhang, B., & Leung, D. H. Y. (2009). Empirical likelihood in missing data problems. Journal of the American Statistical Association, 104, 1492–1503. doi: 10.1198/jasa.2009.tm08163
  • Riddles, M. K. (2013). Propensity score adjusted method for missing data (Doctoral dissertation). Iowa State University.
  • Riddles, M. K., Kim, J. K., & Im, J. (2016). A propensity-score-adjustment method for nonignorable nonresponse. Journal of Survey Statistics and Methodology, 4, 215–245. doi: 10.1093/jssam/smv047
  • Robins, J. M., Rotnitzky, A., & Zhao, L. P. (1994). Estimation of regression coefficients when some regressors are not always observed. Journal of the American statistical Association, 89, 846–866. doi: 10.1080/01621459.1994.10476818
  • Rubin, D. B. (1976). Inference and missing data. Biometrika, 63, 581–592. doi: 10.1093/biomet/63.3.581
  • Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6, 461–464. doi: 10.1214/aos/1176344136
  • Shao, J., & Wang, L. (2016). Semiparametric inverse propensity weighting for nonignorable missing data. Biometrika, 103, 175–187. doi: 10.1093/biomet/asv071
  • Shen, C. W., & Chen, Y. H. (2012). Model selection for generalized estimating equations accommodating dropout missingness. Biometrics, 68, 1046–1054. doi: 10.1111/j.1541-0420.2012.01758.x
  • Shen, C. W., & Chen, Y. H. (2013). Model selection of generalized estimating equations with multiply imputed longitudinal data. Biometrical Journal, 55, 899–911. doi: 10.1002/bimj.201200236
  • Shi, X. Y., Zhu, H. T., & Ibrahim, J. G. (2009). Local influence for generalized linear models with missing covariates. Biometrics, 65, 1164–1174. doi: 10.1111/j.1541-0420.2008.01179.x
  • Spieglhalter, D. J., Thomas, A., Best, N. G., & Lunn, D. (2003). WinBUGS user manual (Version 1.4). Cambridge: MRC Biostatistics Unit.
  • Stute, W., Xue, L., & Zhu, L. (2007). Empirical likelihood inference in nonlinear errors-in-covariables models with validation data. Journal of the American Statistical Association, 102, 332–346. doi: 10.1198/016214506000000816
  • Tang, N. S., Chow, S. M., Ibrahim, J. G., & Zhu, H. T. (2017). Bayesian sensitivity analysis of a nonlinear dynamic factor analysis model with nonparametric prior and possible nonignorable missingness. Psychometrika, 82, 1–29. doi: 10.1007/s11336-017-9587-4
  • Tang, N. S., & Tang, L. (2018). Estimation and variable selection in generalized partially nonlinear models with nonignorable missing responses. Statistics and its Interface, 11, 1–18. doi: 10.4310/SII.2018.v11.n1.a1
  • Tang, N., Yan, X., & Zhao, P. (2018). Exponentially tilted likelihood inference on growing dimensional unconditional moment models. Journal of Econometrics, 202, 57–74. doi: 10.1016/j.jeconom.2017.08.018
  • Tang, N. S., & Zhao, P. Y. (2013). Empirical likelihood-based inference in nonlinear regression models with missing responses at random. Statistics, 47, 1141–1159. doi: 10.1080/02331888.2012.658807
  • Tang, N. S., Zhao, P. Y., & Zhu, H. T. (2014). Empirical likelihood for estimating equations with nonignorably missing data. Statistica Sinica, 24, 723.
  • Tanner, M. A., & Wong, W. H. (1987). The calculation of posterior distributions by data augmentation. Journal of the American statistical Association, 82, 528–540. doi: 10.1080/01621459.1987.10478458
  • Troxel, A. B., Ma, G., & Heitjan, D. F. (2004). An index of local sensitivity to nonignorability. Statistica Sinica, 14, 1221–1237.
  • Verbeke, G., Molenberghs, G., Thijs, H., Lesaffre, E., & Kenward, M. G. (2001). Sensitivity analysis for nonrandom dropout: A local influence approach. Biometrics, 57, 7–14. doi: 10.1111/j.0006-341X.2001.00007.x
  • Wang, D., & Chen, S. X (2009). Empirical likelihood for estimating equations with missing values. Annals of Statistics, 37, 490–517. doi: 10.1214/07-AOS585
  • Wang, Q. H., & Li, Y. J. (2018). How to make model-free feature screening approaches for full data applicable to the case of missing response? Scandinavian Journal of Statistics, 45, 324–346. doi: 10.1111/sjos.12290
  • Wang, Q. H., & Qin, Y. S. (2010). Empirical likelihood confidence bands for distribution functions with missing responses. Journal of Statistical Planning and Inference, 140, 2778–2789. doi: 10.1016/j.jspi.2010.03.044
  • Wang, Q. H., & Rao, J. N. K. (2002). Empirical likelihood-based inference under imputation for missing response data. Annals of Statistics, 30, 896–924. doi: 10.1214/aos/1028674845
  • Wang, S., Shao, J., & Kim, J. K. (2014). An instrumental variable approach for identification and estimation with nonignorable nonresponse. Statistica Sinica, 24, 1097–1116.
  • Wei, B. C. (1998). Exponential family nonlinear models. Singapore: Springer-Verlag.
  • Wei, Y., Ma, Y. Y., & Carroll, R. J. (2012). Multiple imputation in quantile regression. Biometrika, 99, 423–438. doi: 10.1093/biomet/ass007
  • Wei, G. G., & Tanner, M. (1990). A Monte Carlo implementation of the EM algorithm and the poor man's data augmentation algorithms. Publications of the American Statistical Association, 85, 699–704. doi: 10.1080/01621459.1990.10474930
  • Xu, C., & Chen, J. H. (2014). The sparse MLE for ultrahigh-dimensional feature screening. Journal of the American Statistical Association, 109, 1257–1269. doi: 10.1080/01621459.2013.879531
  • Xue, L. (2009). Empirical likelihood for linear models with missing responses. Journal of Multivariate Analysis, 100, 1353–1366. doi: 10.1016/j.jmva.2008.12.009
  • Zhang, Y. Q., & Tang, N. S. (2017). Bayesian local influence analysis of general estimating equations with nonignorable missing data. Computational Statistics and Data Analysis, 105, 184–200. doi: 10.1016/j.csda.2016.08.010
  • Zhao, S. D. H., & Li, Y. (2014). Score test variable screening. Biometrics, 70, 862–871. doi: 10.1111/biom.12209
  • Zhao, Y. Y., & Tang, N. S. (2015). Maximum-likelihood estimation and influence analysis in multivariate skew-normal reproductive dispersion mixed models for longitudinal data. Mathematische Operationsforschung Und Statistik, 49, 1348–1365.
  • Zhao, P. Y., Tang, M. L., & Tang, N. S. (2013). Robust estimation of distribution functions and quantiles with non-ignorable missing data. Canadian Journal of Statistics, 41, 575–595. doi: 10.1002/cjs.11195
  • Zhao, H., Zhao, P. Y., & Tang, N. S. (2013). Empirical likelihood inference for mean functionals with nonignorably missing response data. Computational Statistics and Data Analysis, 66, 101–116. doi: 10.1016/j.csda.2013.03.023
  • Zhu, H. T., Ibrahim, J. G., & Chen, M. H. (2015). Diagnostic measures for the cox regression model with missing covariates. Biometrika, 102, 907–923. doi: 10.1093/biomet/asv047
  • Zhu, H. T., Ibrahim, J. G., Cho, H., & Tang, N. S. (2012). Bayesian case influence measures for statistical models with missing data. Journal of Computational and Graphical Statistics, 21, 253–271. doi: 10.1198/jcgs.2011.10139
  • Zhu, H. T., Ibrahim, J. G., & Tang, N. S. (2011). Bayesian influence analysis: A geometric approach. Biometrika, 98, 307–323. doi: 10.1093/biomet/asr009
  • Zhu, H. T., Ibrahim, J. G., & Tang, N. S. (2014). Bayesian sensitivity analysis of statistical models with missing data. Statistica Sinica, 24, 871–896.
  • Zhu, H. T., & Lee, S. Y. (2001). Local influence for incomplete-data models. Journal of the Royal Statistical Society, 63, 111–126. doi: 10.1111/1467-9868.00279
  • Zhu, L. P., Li, L. X., Li, R. Z., & Zhu, L. X. (2011). Model-free feature screening for ultrahigh-dimensional data. Journal of the American Statistical Association, 106, 1464–1475. doi: 10.1198/jasa.2011.tm10563
  • Zou, H., & Li, R. Z. (2008). One-step sparse estimates in nonconcave penalized likelihood models. Annals of statistics, 36, 1509–1533. doi: 10.1214/009053607000000802
  • Zou, H., & Zhang, H. (2009). On the adaptive elastic-net with a diverging number of parameters. Annals of Statistics, 37, 1733–1751. doi: 10.1214/08-AOS625

Appendix. Expressions of and

where , is the kth component of , is the kth row vector of matrix , is the kth row vector of Φ.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.