209
Views
0
CrossRef citations to date
0
Altmetric
SHORT COMMUNICATIONS

Variable screening with missing covariates: a discussion of ‘statistical inference for nonignorable missing data problems: a selective review’ by Niansheng Tang and Yuanyuan Ju

&
Pages 134-136 | Received 20 Aug 2018, Accepted 09 Sep 2018, Published online: 22 Sep 2018

ABSTRACT

Feature screening with missing data is a critical problem but has not been well addressed in the literature. In this discussion we propose a new screening index based on “information value” and apply it to feature screening with missing covariates.

We thank Tang and Ju for their extensively review for the methods dealing with a challenging statistical problem: missing data. The methods discussed in the paper mainly focus on low dimensional data. In the discussion part, the paper mentioned feature screening with missing data, which is a critical research topic but has not been well addressed in the literature.

Several works have been done to handle feature screening with response missing at random. For example, (Lai, Liu, Liu, & Wan, Citation2017) used inverse probability weighting method to recover the screening indexes when missing data exist. Wang and Li (Citation2018) proposed a missing indicator imputation screening procedure by noting the fact that the set of the active covariates for the response is a subset of the active covariates for the product of the response and missingness indicator. There are two possible directions to further discuss the feature screening methods with missing data. First is to consider screening with nonignorable missing response, which could be quite challenging. Second is to consider screening with missing covariates.

Missing covariate data commonly exist in such health and biomedical related studies as clinical trials, observational data, environmental studies, and health surveys. How to conduct feature screening when some covariates are missing is an interesting problem. While it could be difficult to solve this problem in general, there are special cases in which screening with missing covariates is possible. Here we discuss one special case: the response Y is binary and all the covariates are categorical. If there is no missing data, the PC-SIS in (Huang, Li, & Wang, Citation2014), IG-SIS in Fang (Citation2016) and APC-SIS in Ni, Fang, and Wan (Citation2017) all have sure screening property (Fan & Lv, Citation2008). Other than these methods, we propose a new screening index “information value”, which is defined as (1) where Y is the binary response with values 1 or 2 and X is a categorical covariate with values . It is easy to see that if and only if X and Y are statistically independent. If we select the covariates with the largest d estimated IV values as the active covariates, it is not hard to show that this screening procedure has sure screening property.

If X has missing data, then can not be estimated directly. Let be the missingness indicator: is X is observed and if X is missing. Define a new categorical covariate as We may want to see what is the relationship between and . Actually we have the following two conclusions:

  1. If , then .

  2. If , then .

The first conclusion tells us that if X is missing completed at random, we can use to recover . The second conclusion tells us that when the missing probability of X only depends on X itself, will always underestimate . So is not likely to mistakenly select inactive covariates. However, it may miss some active covariates.

Other than considering , we may also consider the commonly used AC (available case) method. That is, we only use the non-missing data of X to estimate . Denote as the AC analog of . In what situation we can use to recover ? Consider two covariates and , where has missing data and is always observed. Under the following two conditions:

(C1)

,

(C2)

,

we have =. Condition (C1) means is missing at random. Condition (C2) means and are conditionally (on Y ) independent, which is similar to the condition required by naive bayes. Missing at random may be a reasonable assumption in many situations. But conditional independence usually does not hold. However, this AC method still works well in several simulations conducted by us even (C2) does not hold. Just like naive bayes works well in many situations even the conditional independence condition is violated. Here we only discuss two covariates and , but all the conditions and conclusions can be extended to two groups of covariates, in which one group has missing data and the other group is always observed.

Finally we propose a method which is more applicable than the two methods discussed above based on or . Denote as the covariates with missing data and as the covariates without missing data. For each missing covariate , the missing indicator is denoted as , . We assume that where is a small subset of and , i.e. is missing at random and the missing probability only depends on Y and a small subset of covariates that are always observed. Then (2) can be easily estimated if is known, where r=1 or 2, and the summation is over all possible values of . Then further we can estimate . We propose a two-step screening procedure as follows:

Step 1: Apply APC-SIS or IG-SIS on data to get .

Step 2: Estimate based on and (Equation2). Further estimate based on (Equation1). can be estimated regularly since is fully observed. Then we can select the covariates with the largest d estimated IV values.

Under some regularity conditions, this screening procedure has sure screening property.

Disclosure statement

No potential conflict of interest was reported by the authors.

Additional information

Notes on contributors

Fang Fang

Fang Fang is an associate professor at Key Laboratory of Advanced Theory and Application in Statistics and Data Science (East China Normal University), Ministry of Education, and School of Statistics, East China Normal University. He is also an Associate Editor of Journal of Nonparametric Statistics. His research interests mainly focus on missing data, model averaging and statistical learning.

Lyu Ni

Lyu Ni is a Ph.D. candidate of statistics at the School of Statistics, East China Normal University. Her research areas are feature screening in high-dimensional data and missing data analysis.

References

  • Fan, J., & Lv, J. (2008). Sure independent screening for ultrahigh dimensional feature space (with discussion). Journal of Royal Statistical Society, Series B, 70, 849–911. doi: 10.1111/j.1467-9868.2008.00674.x
  • Fang, L., & Ni, F. (2016). Entropy-based model-free feature screening for ultrahigh-dimensional multiclass classification. Journal of Nonparametric Statistics, 28, 515–530. doi: 10.1080/10485252.2016.1167206
  • Huang, D. Y., Li, R. Z., & Wang, H. S. (2014). Feature screening for ultrahigh dimensional categorical data with applications. Journal of Business & Economic Statistics, 32, 237–244. doi: 10.1080/07350015.2013.863158
  • Lai, P., Liu, Y. M., Liu, Z., & Wan, Y. (2017). Model free feature screening for ultrahigh dimensional data with responses missing at random. Computational Statistics & Data Analysis, 105, 201–216. doi: 10.1016/j.csda.2016.08.008
  • Ni, L., Fang, F., & Wan, F. J. (2017). Adjusted pearson chi-square feature screening for multi-classification with ultrahigh dimensional data. Metrika, 80, 805–828. doi: 10.1007/s00184-017-0629-9
  • Wang, Q. H., & Li, Y. J. (2018). How to make model-free feature screening approaches for full data applicable to the case of missing response?. Scandinavian Journal of Statistics, 45, 324–346. doi: 10.1111/sjos.12290

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.