ABSTRACT
Feature screening with missing data is a critical problem but has not been well addressed in the literature. In this discussion we propose a new screening index based on “information value” and apply it to feature screening with missing covariates.
We thank Tang and Ju for their extensively review for the methods dealing with a challenging statistical problem: missing data. The methods discussed in the paper mainly focus on low dimensional data. In the discussion part, the paper mentioned feature screening with missing data, which is a critical research topic but has not been well addressed in the literature.
Several works have been done to handle feature screening with response missing at random. For example, (Lai, Liu, Liu, & Wan, Citation2017) used inverse probability weighting method to recover the screening indexes when missing data exist. Wang and Li (Citation2018) proposed a missing indicator imputation screening procedure by noting the fact that the set of the active covariates for the response is a subset of the active covariates for the product of the response and missingness indicator. There are two possible directions to further discuss the feature screening methods with missing data. First is to consider screening with nonignorable missing response, which could be quite challenging. Second is to consider screening with missing covariates.
Missing covariate data commonly exist in such health and biomedical related studies as clinical trials, observational data, environmental studies, and health surveys. How to conduct feature screening when some covariates are missing is an interesting problem. While it could be difficult to solve this problem in general, there are special cases in which screening with missing covariates is possible. Here we discuss one special case: the response Y is binary and all the covariates are categorical. If there is no missing data, the PC-SIS in (Huang, Li, & Wang, Citation2014), IG-SIS in Fang (Citation2016) and APC-SIS in Ni, Fang, and Wan (Citation2017) all have sure screening property (Fan & Lv, Citation2008). Other than these methods, we propose a new screening index “information value”, which is defined as (1) where Y is the binary response with values 1 or 2 and X is a categorical covariate with values . It is easy to see that if and only if X and Y are statistically independent. If we select the covariates with the largest d estimated IV values as the active covariates, it is not hard to show that this screening procedure has sure screening property.
If X has missing data, then can not be estimated directly. Let be the missingness indicator: is X is observed and if X is missing. Define a new categorical covariate as We may want to see what is the relationship between and . Actually we have the following two conclusions:
If , then .
If , then .
The first conclusion tells us that if X is missing completed at random, we can use to recover . The second conclusion tells us that when the missing probability of X only depends on X itself, will always underestimate . So is not likely to mistakenly select inactive covariates. However, it may miss some active covariates.
Other than considering , we may also consider the commonly used AC (available case) method. That is, we only use the non-missing data of X to estimate . Denote as the AC analog of . In what situation we can use to recover ? Consider two covariates and , where has missing data and is always observed. Under the following two conditions:
(C1) | , | ||||
(C2) | , |
we have =. Condition (C1) means is missing at random. Condition (C2) means and are conditionally (on Y ) independent, which is similar to the condition required by naive bayes. Missing at random may be a reasonable assumption in many situations. But conditional independence usually does not hold. However, this AC method still works well in several simulations conducted by us even (C2) does not hold. Just like naive bayes works well in many situations even the conditional independence condition is violated. Here we only discuss two covariates and , but all the conditions and conclusions can be extended to two groups of covariates, in which one group has missing data and the other group is always observed.
Finally we propose a method which is more applicable than the two methods discussed above based on or . Denote as the covariates with missing data and as the covariates without missing data. For each missing covariate , the missing indicator is denoted as , . We assume that where is a small subset of and , i.e. is missing at random and the missing probability only depends on Y and a small subset of covariates that are always observed. Then (2) can be easily estimated if is known, where r=1 or 2, and the summation is over all possible values of . Then further we can estimate . We propose a two-step screening procedure as follows:
Step 1: Apply APC-SIS or IG-SIS on data to get .
Step 2: Estimate based on and (Equation2(2) ). Further estimate based on (Equation1(1) ). can be estimated regularly since is fully observed. Then we can select the covariates with the largest d estimated IV values.
Under some regularity conditions, this screening procedure has sure screening property.
Disclosure statement
No potential conflict of interest was reported by the authors.
Additional information
Notes on contributors
Fang Fang
Fang Fang is an associate professor at Key Laboratory of Advanced Theory and Application in Statistics and Data Science (East China Normal University), Ministry of Education, and School of Statistics, East China Normal University. He is also an Associate Editor of Journal of Nonparametric Statistics. His research interests mainly focus on missing data, model averaging and statistical learning.
Lyu Ni
Lyu Ni is a Ph.D. candidate of statistics at the School of Statistics, East China Normal University. Her research areas are feature screening in high-dimensional data and missing data analysis.
References
- Fan, J., & Lv, J. (2008). Sure independent screening for ultrahigh dimensional feature space (with discussion). Journal of Royal Statistical Society, Series B, 70, 849–911. doi: 10.1111/j.1467-9868.2008.00674.x
- Fang, L., & Ni, F. (2016). Entropy-based model-free feature screening for ultrahigh-dimensional multiclass classification. Journal of Nonparametric Statistics, 28, 515–530. doi: 10.1080/10485252.2016.1167206
- Huang, D. Y., Li, R. Z., & Wang, H. S. (2014). Feature screening for ultrahigh dimensional categorical data with applications. Journal of Business & Economic Statistics, 32, 237–244. doi: 10.1080/07350015.2013.863158
- Lai, P., Liu, Y. M., Liu, Z., & Wan, Y. (2017). Model free feature screening for ultrahigh dimensional data with responses missing at random. Computational Statistics & Data Analysis, 105, 201–216. doi: 10.1016/j.csda.2016.08.008
- Ni, L., Fang, F., & Wan, F. J. (2017). Adjusted pearson chi-square feature screening for multi-classification with ultrahigh dimensional data. Metrika, 80, 805–828. doi: 10.1007/s00184-017-0629-9
- Wang, Q. H., & Li, Y. J. (2018). How to make model-free feature screening approaches for full data applicable to the case of missing response?. Scandinavian Journal of Statistics, 45, 324–346. doi: 10.1111/sjos.12290