253
Views
1
CrossRef citations to date
0
Altmetric
SHORT COMMUNICATIONS

Statistical methods without estimating the missingness mechanism: a discussion of ‘statistical inference for nonignorable missing data problems: a selective review’ by Niansheng Tang and Yuanyuan Ju

Pages 143-145 | Received 04 Sep 2018, Accepted 09 Sep 2018, Published online: 20 Sep 2018

First of all, I wholeheartedly congratulate Tang and Ju (referred to as TJ hereafter) on a well-written comprehensive review paper that surveys cutting-edge statistical theory and methodology relevant to estimation, influence analysis and model selection in regression models with missing data.

TJ begins their presentation from the missing data mechanism, a fundamental concept in the missing data literature (Kim and Shao, Citation2013; Little and Rubin, Citation2002; Molenberghs et al., Citation2014; Tsiatis, Citation2006). In their Section 2, TJ presents a detailed explanation of this definition and underlines its importance to developing downstream statistical methodology. To facilitate this discussion, I adopt the same notation as follows. Consider a regression model where Y is a response variable and is a p-dimensional explanatory variable, and are n independent and identically distributed realisations of . Assume is always fully observed but Y is subject to missingness. Let δ be the missing data indicator for Y, that is, if Y is missing, and otherwise. Then the missing data mechanism is the conditional distribution of δ given and Y, i.e. (1) One intrinsic complication of the missing data mechanism is that, only except for a few scenarios (d'Haultfoeuille, Citation2010; Little, Citation1988), its underlying truth is difficult to verify. The reason due to its plausible dependence on Y, an incompletely observed variable. This issue pronounces more clearly when one moves forward to real application, where the investigators would be more satisfied if a statistical method could make the assumption of the mechanism less stringently so that it is able to be flexibly applied to various scenarios.

My discussion, motivated by the need of developing versatile statistical procedures that would provide robust protection to certain mechanism misspecification, showcases the up-to-date statistical treatments where the mechanism model assumption is only imposed at a minimum level. The discussion concentrates on brief introduction of two types of these assumptions and spans diverse statistical topics including model identification, point estimation, hypothesis testing and high dimensional variable selection.

One distinct feature of the methods in this discussion is that the mechanism model would be treated as a nuisance, hence all the methods could be carried out without the need of estimating the mechanism.

1. Mechanism based on conditional independence

The instrumental variable is a well-studied method in econometrics, epidemiology and related disciplines. The key step of applying this method is certain requirement about the conditional independence among variables. Zhao and Shao (Citation2015) proposed to take advantage of the nonresponse instrument , a component of , to analyse missing data, especially nonignorable missing data. The concept of nonresponse instrument shares the similar spirit to the instrumental variable. To be more specific, Zhao and Shao (Citation2015) assumed that (2) where . Some further requirement, e.g. , is also needed for model identification purpose.

When by itself serves as the nonresponse instrument, Tang, Little, and Raghunathan (Citation2003) studied this special situation and proposed to estimate the unknown parameter in through the conditional likelihood of : where represents the unspecified probability density function of . Then the objective becomes to a semiparametric function: To solve for , an estimator of is needed. Three straightforward estimators could be considered: the true ; a parametric with α estimated as through full data likelihood method; a nonparametric with its cumulative distribution function estimated by its empirical version. These three alternatives lead to three different pseudolikelihood estimators of : , and . At first sight, one would believe that is superior to the other two in terms of estimation efficiency. However, Tang et al. (Citation2003) showed that is always less efficient than . In a recent paper, Zhao and Ma (Citation2018) further proved that is always less efficient than and there is no other method which could lead to a more efficient estimator than , hence is optimal.

Other work along this line includes Miao and Tchetgen (Citation2016) exploring different types of doubly robust estimators and Fang, Zhao, and Shao (Citation2018) extending the idea to missing covariate and proposing some imputation approach based on estimating equations.

2. Mechanism based on statistical chromatography

The other unspecified missing data mechanism investigated in the literature is to assume a decomposable model (3) where and are two unspecified functions. It is clear that, MCAR ( constant) and MAR ( constant) are special cases of this assumption. When constant, it becomes the case discussed in Section 1 where on its own serves as the nonresponse instrument.

A pivotal observation following (Equation3) is that, and could be bridged as Note that preserves to be a function of -only multiples a function of y-only. Using the idea of the conditional likelihood (Kalbfleisch, Citation1978), decomposing the observed 's as its rank statistic and order statistic, considering the likelihood conditional on the order statistic, Liang and Qin (Citation2000) proposed the following objective function to estimating : (4) where the first m subjects are fully observed without the loss of generality.

The key here is that we model the data at a more refined granularity of rank and order statistics, so that sophisticated conditioning arguments could be applied to separate the parameter of interest and other nuisance components. Hence we call this procedure statistical chromatography.

We elaborate under the generalised linear model framework where with link function structure . With canonical link, to maximise (Equation4) is equivalent to minimising where , . Hence to compensate for missing data, we could only estimate as opposed to the whole unknown parameter . Although only is estimable, the hypothesis testing versus could still be carried out since the null hypothesis is equivalent to . The detailed Wald type test statistic needs the asymptotic distribution of the estimator of under this scheme (Zhao and Shao, Citation2017). With noncanonical link, Zhao and Shao (Citation2017) showed that, interestingly, the whole unknown parameter is estimable under some situations.

Finally I would like to point out a regularisation approach for high-dimensional variable selection with missing data using this approach. The essential idea is to identify ‘important’ variables through whether the corresponding estimator equals zero or not. The penalised likelihood function is where could be any penalty function, and is the tuning parameter. Zhao et al. (Citation2018) proved that the validity of the selection consistency allows p to grow at a rate exponentially fast with n as with . In penalised likelihood approach for variable selection, the determination of the tuning parameter is also critical. Zhao and Yang (Citation2017) further studied some stability enhanced tuning parameter selection methods following this approach.

Disclosure statement

No potential conflict of interest was reported by the author.

Additional information

Funding

This work was supported by National Center for Advancing Translational Sciences [UL1TR001412].

Notes on contributors

Jiwei Zhao

Jiwei Zhao is Assistant Professor in Department of Biostatistics at the State University of New York at Buffalo. He mainly works on statistical problems motivated from various disciplines such as mental health, orthopaedics and sports medicine, women's health, aging research and the use of electronic medical records. He is generally interested in nonignorable missing data, semiparametric theory, nonregular likelihoods and semisupervised learning.

References

  • d'Haultfoeuille, X. (2010). A new instrumental method for dealing with endogenous selection. Journal of Econometrics, 154, 1–15. doi: 10.1016/j.jeconom.2009.06.005
  • Fang, F., Zhao, J., & Shao, J. (2018). Imputation-based adjusted score equations in generalized linear models with nonignorable missing covariate values. Statistica Sinica, 28.
  • Kalbfleisch, J. D. (1978). Likelihood methods and nonparametric tests. Journal of the American Statistical Association, 73, 167–170. doi: 10.1080/01621459.1978.10480021
  • Kim, J. K., & Shao, J. (2013). Statistical Methods for Handling Incomplete Data. Boca Raton, FL: Chapman & Hall/CRC.
  • Liang, K.-Y., & Qin, J. (2000). Regression analysis under non-standard situations: a pairwise pseudolikelihood approach. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 62, 773–786. doi: 10.1111/1467-9868.00263
  • Little, R. J. (1988). A test of missing completely at random for multivariate data with missing values. Journal of the American Statistical Association, 83, 1198–1202. doi: 10.1080/01621459.1988.10478722
  • Little, R. J., & Rubin, D. B. (2002). Statistical analysis with missing data (2nd). Hoboken, NJ: Wiley.
  • Miao, W., & Tchetgen Tchetgen, E. J. (2016). On varieties of doubly robust estimators under missingness not at random with a shadow variable. Biometrika, 103, 475–482. doi: 10.1093/biomet/asw016
  • Molenberghs, G., Fitzmaurice, G., Kenward, M. G., Tsiatis, A. A., & Verbeke, G. (2014). Handbook of Missing Data Methodology. Boca Raton, FL: Chapman & Hall/CRC Press.
  • Tang, G., Little, R. J., & Raghunathan, T. E. (2003). Analysis of multivariate missing data with nonignorable nonresponse. Biometrika, 90, 747–764. doi: 10.1093/biomet/90.4.747
  • Tsiatis, A. A. (2006). Semiparametric Theory and Missing Data. New York, NY: Springer.
  • Zhao, J., & Ma, Y. (2018). Optimal pseudolikelihood estimation in the analysis of multivariate missing data with nonignorable nonresponse. Biometrika, 105, 479–486. doi: 10.1093/biomet/asy007
  • Zhao, J., & Shao, J. (2015). Semiparametric pseudo-likelihoods in generalized linear models with nonignorable missing data. Journal of the American Statistical Association, 110, 1577–1590. doi: 10.1080/01621459.2014.983234
  • Zhao, J., & Shao, J. (2017). Approximate conditional likelihood for generalized linear models with general missing data mechanism. Journal of Systems Science and Complexity, 30, 139–153. doi: 10.1007/s11424-017-6188-3
  • Zhao, J., & Yang, Y. (2017). Tuning parameter selection in the LASSO with unspecified propensity. In D.-G. Chen, Z. Jin, G. Li, Y. Li, A. Liu, & Y. Zhao (Eds.), New Advances in Statistics and Data Science (pp. 109–125). New York, NY: Springer.
  • Zhao, J., Yang, Y., & Ning, Y. (2018). Penalized pairwise pseudo likelihood for variable selection with nonignorable missing data. Statistica Sinica, 28.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.