600
Views
5
CrossRef citations to date
0
Altmetric
Articles

Group screening for ultra-high-dimensional feature under linear model

, , &
Pages 43-54 | Received 18 Jul 2018, Accepted 17 Jun 2019, Published online: 04 Jul 2019

Abstract

Ultra-high-dimensional data with grouping structures arise naturally in many contemporary statistical problems, such as gene-wide association studies and the multi-factor analysis-of-variance (ANOVA). To address this issue, we proposed a group screening method to do variables selection on groups of variables in linear models. This group screening method is based on a working independence, and sure screening property is also established for our approach. To enhance the finite sample performance, a data-driven thresholding and a two-stage iterative procedure are developed. To the best of our knowledge, screening for grouped variables rarely appeared in the literature, and this method can be regarded as an important and non-trivial extension of screening for individual variables. An extensive simulation study and a real data analysis demonstrate its finite sample performance.

AMS 2000 Subject Classifications:

1. Introduction

Nowadays, grouping predictors arise naturally in many regression problems. It means that we are interested in finding relevant predictors in modelling the response variable, where each predictors may be represented by a group of indictor variables or a set of basis functions. Grouping structures can be introduced into a regression model naturally in hoping that the prior knowledge about predictors may be used to the full. Thus, grouping structure problems become increasingly important in various research fields. One common example is the representation of multi-level analysis-of-variance (ANOVA) in a regression model with a group of derived input variables. The aim of ANOVA is often to select relevant main factors and interactions, that is the selection of groups of derived input variables. Another example is the additive model with nonparametric components, where each component can be expressed a linear combination of a set of basis functions of the original predictors. Thus, in both cases, variable selection amounts to the selection of groups of variables rather than individual derived variables.

Using the penalised method, many researchers have considered the group selection problems in various parametric or nonparametric regression models. These articles include, but are not limited to the following. First, Bakin (Citation1999) proposed the group LASSO in his doctoral dissertation. Yuan and Lin (Citation2006) further studied the group LASSO and related group selection methods, such as the group LARS and the group Garrote, and proposed the corresponding algorithms. But they did not give any asymptotic properties of the group LASSO. Wei and Huang (Citation2010) showed that, under a generalised sparsity condition and the sparse Riesz condition proposed by Zhang and Huang (Citation2008), together with some regularity conditions, the group LASSO can select a model with the same order as the underlying model. They also established the asymptotic properties of the adaptive group LASSO, which can correctly select groups with probability tending to one. Under the assumption of generalised linear models, Breheny and Huang (Citation2009) established a general framework for simultaneous group and individual variable selection, or bi-level selection and the corresponding local coordinate descent algorithm. In addition to the group LASSO, many authors also proposed other methods for various parametric models. For example, Huang, Ma, Xie, and Zhang (Citation2009) showed that simultaneous group and individual variable selection can be conducted by a group bridge method. They showed that it can correctly selected relevant groups with probability tending to one. Moreover, Zhao, Rocha, and Yu (Citation2009) introduced a quite general composite penalty for groups selection by combing different norms to form an intelligent penalty. All these methods are very useful for moderate number of predictors to be smaller than the sample size or comparable with it. However, with rapid progress of computing power and modern technology for data collection, massive amounts of ultra-high-dimensional data are frequently seen in diverse fields of scientific research. Due to the “curse of dimensionality” in terms of simultaneous challenges to computational expediency, statistical accuracy and algorithm stability, the above methods are limited in handling ultra-high-dimensional problems.

In the seminal work of Fan and Lv (Citation2008), a new framework for sure independence screening (SIS) was established. They showed that the method based on Pearson correlation learning possess a sure screening property for linear regressions. That is, all relevant predictors can be selected with probability tending to one even if the number of predictors p can grow much faster than the number of observations n with log p=O(nα) for some α0,12. Following Fan and Lv (Citation2008), we call this non-polynomial dimensionality or ultra-high dimensionality. From this on, various screening methods based on model assumption or model free have been developed (Fan & Lv, Citation2008; Fan, Samworth, & Wu, Citation2009; Fan & Song, Citation2010; He, Wang, & Hong, Citation2013; Li, Zhong, & Zhu, Citation2012; Shao & Zhang, Citation2014; Wang, Citation2009; Zhao & Li, Citation2012). However, all these screening methods deal with the individual variables rather than grouped predictors. To the best of our knowledge, screening methods for grouped predictors are quite limited in existing literatures. Thus it is very important to propose a new screening method based on grouped predictors.

Motivated by the theory of SIS, we consider how to deal with these ultra-high-dimensional grouped predictors in the assumption of linear regression model. Considering grouping structures in linear regression model, we have (1) Y=j=1JXjTβj+ε,(1) where Y is the response variable, Xj=(Xj1,,Xjpj)T is a pj×1 random vector representing the jth group, βj=(βj1,,βjpj)T is the pj×1 parameter vector corresponding to the jth group predictors, and ε is the random error with mean 0.

Our method is a two-stage approach. First, an efficient screening procedure is employed to reduce the number of group predictors to a moderate order under sample size, and then the existing group selection methods can be used to recover the final sparse model. Due to fastness and efficiency of the screening method for group predictors, we consider a independence screening method by ranking the magnitude of marginal estimators based on each grouped predictor. That is, we fit p marginal linear regressions of the response Y against the variables of the jth group respectively, and the select the relevant group predictors by a measure of the goodness of fit in its marginal linear regression model. Under some mild conditions, We show that there is a significant difference between relevant group predictors and irrelevant ones, according to the strength of these marginal utility. Thus we can distinguish active group predictors from much more inactive ones. Next, the existing group selection methods, such as the group LASSO, the group SCAD (Breheny & Huang, Citation2015) and the group MCP (Breheny & Huang, Citation2015), can be used to obtain the final sparse model. We refer to our screening procedure as the Group-SIS, and theoretically establish the sure screening property of our approach. In order to further reduce the false-positive rate, we propose a iterative version of algorithm, named ISIS-Group-Lasso. To enhance performance and speed up the computation of ISIS-Group-Lasso, a greedy modification to the above iterative algorithm, named g-ISIS-Group-Lasso is also developed. Our simulation studies indicate that ISIS-Group-Lasso and g-ISIS-Group-Lasso significantly outperform the competitive group selection, such as distance correlation-based screening method and the group LASSO, especially when the dimensionality is ultra-high.

The rest of the article is organised as follows. In Section 2, we introduce a marginal group SIS in linear regression models. Under some mild conditions, the sure screening property and model selection consistency of the Group-SIS will be established in Section 3. In Section 4, simulation studies and a real data analysis are carried out to assess the performance of our method. Concluding remarks are given in Section 5. All technical proofs for the main theoretical results are given in the Appendix.

2. Group SIS and iterative algorithm

2.1. Marginal linear regression based on grouped predictors

Suppose that we have n random sample from model (Equation1) of the form (2) yi=j=1JxijTβj+εi,i=1,2,,n,(2) in which xij=(xij1,,xijpj)T, yi and εi are scalar. Let y=(y1,,yn)T, xj=(x1j,,xnj)T and ε=(ε1,,εn)T, where xj is the n×pj design matrix corresponding to the jth group for each j=1,2,,p. Then the model (Equation2) can be rewritten as (3) y=j=1Jxjβj+ε.(3) For simplicity, the number of variables in each group is uniformly bounded. That is, there exists a positive constant K such that pjK for j=1,2,,J. To rapidly select the relevant grouped predictors, we consider the following J marginal linear regressions against the grouped predictors: (4) minγjRpji=1n(yixijTγj)2,(4) where γj=(γj1,,γjpj) is a pj-dimensional vector for each j=1,2,,J.

It is easy to see that the minimiser of (Equation4) is given by γˆj=(xjTxj)1xjTy. Then we define the marginal utility of the jth grouped predictors as (5) υˆnjn21ni=1n(γˆjTxij)2=1nyTxj(xjTxj)1xjTy.(5) We now select a set of relevant grouped predictors as follows: Mˆκ=1jJ:1pjυˆnjn2πn, where κ is a positive constant and πn is a pre-specified threshold value which will be given later.

Equivalently, we can also define a screening criterion by ranking the residual sum of squares of the corresponding marginal linear regression. These two ways can reduce the group dimensionality from J to a moderate size |Mˆκ|. Here, the pre-specified threshold value is crucial in the screening procedure. If we choose it too small, we may select many irrelevant grouped predictors in the final model. On the contrary, we have the risk of losing some important variables. In a word, we should select all of the relevant grouped predictors and control the selected model size simultaneously. In Section 3, we will theoretically show that this group screening approach possesses a sure screening property and the final model size is only of polynomial order. Noted that, our method is the same as traditional feature screening when each group has one variable. In this sense, our method can be regarded as a non-trivial extension of feature screening under the context of single feature screening.

2.2. Iterative group-SIS algorithm

For these ultra-high-dimensional group variable selection problems, we propose a two-stage procedure. That is, we first apply a sure screening method such as Group-SIS to reduce the number of groups from J to a relatively large scale d, where the dimensionality of the selected d group is below sample size n. Then we can use a lower dimensional group-wise variable selection procedure, such as group Lasso, group SCAD or group MCP. In this article, we use group lasso penalty as our group selection strategy. In fact, other group variable selection methods would also work.

However, as Fan and Lv (Citation2008) point out, this marginal independence screening method would still suffer from false negative (i.e., miss some important group predictors that are marginally uncorrelated, but jointly correlated with response), and false positive (i.e., select some unimportant group predictors which have higher marginal correlation than some important group variables). Therefore, we propose an iterative framework to enhance the finite performance of this screening method. That is, we can iteratively use a large-scale group screening and moderate-scale group variable selection strategy.

To obtain a data-driven thresholding for independence group screening, we extend the random permutation idea of Zhao and Li (Citation2012), which select a small proportion of inactive variables to enter the model in each screening step. Let x=(x1,x2,,xJ), randomly permute the row of x to get the decouple data x~ and y. Based on the randomly decoupled data (x~,y), which has no relationship between group variables and response, we compute the value of υˆnjn2 similar to υˆnjn2 for j=1,2,,J. These values serve as the baseline of the marginal group screening utilities under the null model (no relationship between group variables and response). To obtain the screening threshold, we choose ωq as the q-ranked magnitude of {υˆnjn2,j=1,2,,J}. In our simulation, we uses q=1, namely, the largest marginal group screening utilities under the null model. For the sake of completeness, our ISIS-Group-Lasso algorithm proceeds as follows.

Step 1. Compute J marginal utility υˆnjn2, and the initial index subset is chosen as A1=1jJ:1pjυˆnjn2ωq.

Step 2. Apply the group Lasso (Breheny & Huang, Citation2009) on the index subset A1 to obtain a subset M1. In this step, we choose the regularisation parameter by the Bayesian Information Criterion (BIC) method.

Step 3. Conditioning on M1, compute the marginal regression υˆnj=minγjRpji=1nyik~M1xik~Tγk~xijTγj2 for each jM1c. By randomly permuting only the groups not in M1, we obtain a new index subset A2 similar to step 1. Apply the group LASSO on the index subset A2M1 to obtain a new subset M2.

Step 4. Repeat the process until we have the final index set Ak such that |Ak|ko or Ak=Al for some l<k.

In order to further reduce false positive and speed up computation, we propose a greedy modification to enhance the finite performance of the above algorithm. Specifically, we restrict the number of the selected groups in the iterative screening steps to be at most J0, a small positive integer, and the procedure stops when none of the group predictors is recruited. In our simulation, we set J0=1. This greedy version of ISIS-Group-Lasso algorithm is called g-ISIS-Group-Lasso. When J0=1, this method is connected with forward regression screening (Wang, Citation2009), which select at most one new group predictor into the model at a time. However, there is a great difference between the two methods, that is our method includes a deletion step via group selection that can remove multiple group predictors. This makes our procedure more effective, because it is more flexible in terms of recruiting and deleting group predictors. Based on our simulation results in Section 4, g-ISIS-Group-Lasso outperforms other methods in terms of lower false-positive rate, higher percentage of selected the corrected model and small model error.

3. Theoretical properties

Before we establish the sure screening property of our method for linear models, let us introduce some notations first. Denote the Euclidean and the sup norm of a vector α by α and α, respectively. For any symmetric matrix A, let A=maxi,jAij be the infinity norm and A the operator norm. Let λmin(A) and λmax(A) be the minimum and maximum eigenvalue of the matrix A. Let X=(X1T,X2T,,XJT)T and E(XXT)=Σ. For each j=1,2,,J and k=1,2,,pj, let [a,b] be the support of Xjk. Define the index set of the truly model M by M={1jJ:βj0}. To gain theoretical insights into the Group-SIS, we need to define υj, which is the population version of (4), by minimising minE(Yυj)2minγjRpjE(YXjTγj)2 with respect to γjRpj. Then we have υj=XjT(EXjXjT)1EXjY. Similar to (Equation5), we define υj2E(XjT(EXjXjT)1EXjY)2=(EXjY)T(EXjXjT)1(EXjY). Next we collect the technical assumptions to establish the sure screening property of our group screening method.

  1. 1pjminjMυj2cnκ, for some 0<κ<12 and c>0.

  2. j=1JXjβj<M1 for M1>0.

  3. For any B>0 and i=1,2,,n, there is a positive constant M2 such that E[exp{B|εi|}]<M2.

  4. For j=1,2,,J, the eigenvalues of Σj=EXjXjT are bounded away from zero and infinity. That is, there are some positive constants τ1 and τ2 such that 0<τ1λmin(Σj)λmax(Σj)τ2<.

Under condition (i), we obtain the minimum signal of the relevant grouped predictors. That is, the magnitude of these marginal utilities of the grouped predictors can preserve the non-sparsity signal of the real model with the fastest convergence rate. This condition is often seen in screening literatures, which is important as it guarantees that marginal utilities carry information about the relevant covariates in the active set. And, conditions (ii) and (iii) are two mild conditions needed in using Berstein's inequality (Van der Vaart & Wellner, Citation1996). Because we allow |Mκ| increase with n, condition (ii) ensures the convergence of j=1JxjTβj. Condition (iv) is also easy to be satisfied, for the small number of variables in each grouped predictor.

Remark 3.1

The above assumptions only serve to help us to further understand the new group screening methodology. Thus these conditions are imposed to facilitate the technical proofs, and the weaker condition may be an interesting topic for future research.

The following Theorem 3.1 provides the sure screening property of our group screening method.

Theorem 3.1

Suppose that conditions (i)–(iv) hold. There exists a constant c1, such that we have P{MMκˆ}14j=1J(pj+pj2)exp{c1n12κ}.

Theorem 3.1 indicates that we can select all the relevant grouped predictors with probability tending to 1, and the key to the theorem's proof is on how to obtain the uniform consistence of υˆjn2 to υjn2. Denote j=1Jpj=p. Note that pjK uniformly, the dimensionality can be handled as high as logp=o(n12κ). That is, under some mild conditions, our method has the sure screening property and can reduce from the exponentially growing dimension p to a relatively moderate scale which will be applied in the next group variable selection. Especially, we should point out that κ is very important to our screening procedure. The greater κ is, the higher number of groups that our method can deal with.

On the other hand, although we can select the relevant grouped predictors with probability tending to 1, the cardinality of the Mκˆ may be relatively large compared with the sample size. That is, there are many unimportant grouped predictors in the final model. Thus controlling the false-positive rate is also necessary for our method. From the perspective in simulation, we have proposed an iterative algorithm in Section 2.2 to enhance the performance of our method in terms of the false selection rates. Under the same conditions as in Theorem 3.1, we will show theoretically that the size of the final model is as large as nκλmax(Σ) with probability tending to one exponentially.

Theorem 3.2

Suppose that conditions (i)–(iv) hold, there exists some positive constant c0, P{|Mκˆ|c0nκλmax(Σ)}14j=1J(pj+pj2)exp{c1n12κ}.

Theorem 3.2 shows that, with probability tending to 1, the size of the selected model by our procedure is as large as polynomial order when λmax(Σ) is of polynomial order. It is crucial to the next group selection stage, which make our two-stage approach much better than traditional group variable selection methods. Because there is no guarantee that existing group selection methods can select the relevant grouped predictors consistently if there are too irrelevant grouped predictors. Thus this theorem, together with Theorem 3.1, implies we can select a model which includes all relevant grouped predictors and a small number of irrelevant ones with high probability.

4. Numerical studies

4.1. Simulation results

In this section, we carry out some simulation studies to demonstrate the finite sample performance of our group screen methods described in Section 2. We consider two group size scenarios of simulation models. In the first scenario, the group sizes are equal. In the second, the group sizes vary. In our simulation, we set all groups with size 5 or 3. We set the sample size n=200, and the following three configurations with J=200, 400, 1000 groups are considered for generating the covariates (x1,x2,,xJ). For example, when J=200 and group size is 5, the final predictor matrix has the number of variables p=1000. To gauge the difficulties of the simulation models, different scenarios of signal-to-noise ratio (SNR) are given in Examples 4.1–4.4, where SNR=Var(j=1JXjTβj)/Var(ϵ). It is obviously that the larger the value of SNR, the higher probability that our group screening method can select the relevant groups. In all examples, the simulation results are based on 200 replications for each parameter setup.

To further explore the finite sample performance of our methods, we create some unimportant group variables highly correlated with the response due to the presence of the important group variables associated with the spurious group variables. The correlation between group variables can be specified as follows.

The group vector is X=(X1T,,XJT)T where Xj=(Xj1,,Xjpj)T, j=1,2,,J. In the following four examples, we set two different group size and three configurations of the number of groups are considered. For example, we set pj=5 for each j=1,,J. To generate X, we first simulate 5J random variables T1,,T5J independently from N(0,1). Then Z1,,Z50 are simulated from a multivariate normal distribution with mean 0 and Cov(Zj1,Zj2)=0.6|j1j2|. For k=1,,5, the group variable Xjk are generated as Xjk=Zj+T5(j1)+k2,j=1,,50,T5(j1)+k,j=51,,J. Here, the group components were the relevant variables, while most of components were spurious variables not used in the model but correlated to the relevant group variables. The random error εi was generated from a standard normal distribution. To ensure that the theoretical value of SNR was not too weak or strong, εi can be multiplied by a constant σ.

Extensive simulation studies have been conducted to demonstrate the finite performance of our group screening method. For comparison purpose, the performance of distance correlation screening (DC-SIS) (Li et al., Citation2012), which is a model-free screening method that uses the distance correlation to replace Pearson correlation in marginal correlation screening, is examined. For the sake of fairness, we propose first to apply DC-SIS to reduce the number of groups to nlogn, and a group-wise variable selection procedure such as group Lasso is conducted to recover the final model. We call DC-SIS followed by group Lasso DC-SIS-Group-Lasso. To enhance the performance of DC-SIS, Zhong and Zhu (Citation2015) propose an iterative version of DC-SIS, named by DC-ISIS, which will also be chosen as a comparison. For the sake of fairness, group Lasso is conducted after DC-ISIS, referring to DC-ISIS-Group-Lasso. At the same time, the performance of group Lasso was also examined. Thus, we have five group screening method (i.e., ISIS-Group-Lasso, g-ISIS-Group-Lasso, Group-Lasso, DC-SIS-Group-Lasso, DC-ISIS-Group-Lasso) under consideration. In the following four examples, we report five performance measures: true positive (TP), false positive (FP), median of the model size (MEDIAN), percentage of occasions on which the exactly correct groups are selected (CORRECT) and model error (Yuan & Lin, Citation2006).

Example 4.1

Each group consists of five variables, and the number of the relevant groups is 4. And we generate the response from the following linear model: Y=X1Tβ1+X2Tβ2+X3Tβ3+X4Tβ4+σε, where β1=(1.5,1.5,1,1,0.5)T,β2=(1.5,0.5,0.5,2,0.5)T,β3=(2,1,1.5,1.5,2)T,β4=(1,2,2,0.5,1)T,β5=β6==βJ=(0,0,0,0,0)T.

Example 4.2

Similar to Example 4.1, the number of the relevant groups is 8 with group size 3. We generate the response from the following linear model: Y=X1Tβ1+X2Tβ2+X3Tβ3+X4Tβ4+X5Tβ5+X6Tβ6+X7Tβ7+X8Tβ8+σε, where β1=(0.5,2,2)T,β2=(1,3,1)T,β3=2(1.5,0.5,2)T,β4=2(1,1.5,2)T,β5=(0.5,2,2)T,β6=(1,3,1)T,β7=2(1.5,0.5,2)T,β8=2(1,1.5,2)T,β9=β10==βJ=(0,0,0)T.

Example 4.3

In this example, the group sizes differ across groups. There are half of the groups with size 5 and the other groups with size 3. The group variables are generated the same way as the above examples. The response variable Y is generate from Y=k=14XkTβk+σε. However, the regression coefficients β1=(0.5,0.5,0.5,2,1)T,β2=(2,0,1,1.5,1)T,β3=(0.5,2,2)T,β4=(1,3,1)T,β5==β0.5(J4)+4=(0,0,0,0,0)T,β0.5(J4)+5==βJ=(0,0,0)T.

Example 4.4

In this example, the group sizes also differ across groups. This example is a more difficult case than Example 4.3, because it has eight groups with more different regression coefficients. There are 0.5J groups with size 5 and the other groups with size 3. The group variables are generated the same way as the above Example 4.3. The response variable Y is generate from Y=k=18XkTβk+σε. However, the regression coefficients β1=(1.5,1.5,1,1,0.5)T,β2=(1.5,0.5,0.5,1.5,0.5)T,β3=(1.5,1,1,1,2)T,β4=(1,1.5,1.5,0.5,1)T,β5=(0.5,2,2)T,β6=(1,3,1)T,β7=2(0.5,2,2)T,β8=2(1,3,1)T,β9==β0.5(J8)+8=(0,0,0,0,0)T,β0.5(J8)+9==βJ=(0,0,0)T.

Detailed simulation results of Examples 4.1–4.4 are given in Tables , respectively. Especially, the boxplots of average model size are presented in Figure . Obviously, in all these four examples, the relevant groups can almost be selected for four methods except for DC-SIS-Group-Lasso, which misses more groups most of the time. In terms of true positives (TP), the iterative version of DC-SIS-Group-Lasso performs well at the cost of increasing the size of the model, which leads to large FP and ME. On the other hand, the number of false-positive groups selected by group Lasso is much larger than the other four methods. Compared with distance correlation-based screening methods, our approaches have better finite performance. This may be due to the reasons for their methods based on model-free framework, while our screening methods take full advantage of the assumptions of the linear model. Especially for the greedy modification, g-ISIS-Group-Lasso, the size of final selected model is much smaller than the other four methods in all examples. Just because of this, g-ISIS-Group-Lasso outperforms its competitors in terms of the percentage of correct selected model. On the other hand, the simulations show that the value of SNR has an important impact on the results of the three methods. In Examples 4.2 and 4.4, there are eight relevant groups, while the number of relevant groups in the other two examples is 4. It means that screening relevant groups in these two examples is difficult than the other two. Thus, if we want achieve sure screening, the value of SNR in these two examples is much larger than the other two.

Figure 1. Boxplots of average model sizes for Example 4.1 under different group number.

Figure 1. Boxplots of average model sizes for Example 4.1 under different group number.

Table 1. Simulation results of MEDIAN, TP, FP, CORRECT and ME for Example 4.1.

Table 2. Simulation results of MEDIAN, TP, FP, CORRECT and ME for Example 4.2.

Table 3. Simulation results of MEDIAN, TP, FP, CORRECT and ME for Example 4.3.

Table 4. Simulation results of MEDIAN, TP, FP, CORRECT and ME for Example 4.4.

4.2. Real example

In this section, we compare ISIS-Group-Lasso, g-ISIS-Group-Lasso and Group-Lasso on colon data (Alon et al., Citation1999). Alon's work reports the application of a two-way clustering method for analysis a data set consisting of the expression patterns of different cell types. For these data, we were interested in finding the genes that are related to colon tumour. In the original colon data, the identity of the 62 samples from colon-cancer patients were analysed with an Affymetrix oligonucleotide Hum6000 array. And these data contain the expression of the 2000 genes with the highest minimal intensity across the 62 tissues, where the genes are placed in order of descending minimal intensity. That is, the original data have 2000 numerical variables. For each continuous variable in the additive model, we use five B-spline basis functions to represent its effect, which is originally used by Yang and Zou (Citation2015) in solving group-lasso penalise learning problems. Thus we obtain 10,000 predictors in 2000 groups after basis function expansion. Three methods are used to select relevant additive components.

Before the analysis, all data are standardised in advance such that each variable has zero mean and unit sample variance. To valuate the performance of the three methods, we used cross-validation and compared the average model size (AMS) and the prediction mean squared error (PE). We randomly partitioned the data into a training data set of 50 observations and a test set of 12 observations. That is, we conduct group screening using 50 observations and the PEs on these 12 test sets. Detailed results based on 100 replications are presented in Table . In addition, the boxplot of the average model size is presented in Figure . As clearly shown in Table , ISIS-Group-Lasso and g-ISIS-Group-Lasso select far fewer genes than Group-Lasso, while the first two methods have a smaller PE. In conclusion, the proposed iterative group screening approach is very useful in high-dimensional scientific studies, which can select a parsimonious model and reveal interesting relationship between group variables.

Figure 2. Boxplot of average model sizes for colon data analysis.

Figure 2. Boxplot of average model sizes for colon data analysis.

Table 5. Results of AMS and CORRECT for colon data.

5. Concluding remarks

In this article, we have proposed the marginal group sure screening method under the context of ultra-high dimensionality. Unlike most existing literatures, we deal with variables, which can be naturally grouped. Our group screening method respects the grouping structure in the data and is based on a working independence. Theoretically, we establish the sure screening property for this group screening approach. To enhance the finite sample performance, a data-driven thresholding and an iterative procedure, ISIS-Group-Lasso, are developed. A greedy modification to the iterative procedure, g-ISIS-Group-Lasso is also proposed to further reduce the false positive. Simulation results show that these two methods perform well in terms of the five performance measures.

This article leaves the problems of extending the ISIS-Group-Lasso and g-ISIS-Group-Lasso under linear model to the family of generalised linear model and other parametric models. And model-free group screening approach may be appealing for dealing with ultra-high-dimensional data more generally, which avoids the difficult task of specifying the form of a statistical model. These problems are beyond the scopes of this article and are interesting topics for future research.

Disclosure statement

No potential conflict of interest was reported by the authors.

Additional information

Funding

This research was supported by the National Natural Science Foundation of China (CN) (11571112), the National Social Science Foundation Key Program (17ZDA091) and Natural Science Fund of Education Department of Anhui Province (KJ2013B233), the 111 Project of China (B14019).

Notes on contributors

Yong Niu

Yong Niu is a PhD candidate in the College of Statistics, East China Normal University, Shanghai, China. His research interests include high dimensional data, big data analytics and nonparametric statistics.

Riquan Zhang

Riquan Zhang is a professor and chair of School of Statistics in East China Normal University. His research interests include high dimensional data, big data analytics, functional data analysis, statistical machine learning and nonparametric statistics.

Jicai Liu

Jicai Liu is an associate professor of statistics in the department of mathematics at Shanghai Normal University, China. His research interests include high dimensional data, lifetime data analysis and nonparametric statistics.

Huapeng Li

Huapeng Li is an associate professor of statistics in the school of mathematics and statistics at Datong University, China. His research interests include nonparametric and semiparametric statistics based on empirical likelihood, selection biased data and finite mixture models.

References

  • Alon, U., Barkai, N., Notterman, D. A., Gish, K., Ybarra, S., Mack, D., & Levine, A. J. (1999). Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academy of Sciences, 96(12), 6745–6750. doi: 10.1073/pnas.96.12.6745
  • Bakin, S. (1999). Adaptive regression and model selection in data mining problems (Ph.D. thesis). Australian National University, Canberra.
  • Breheny, P., & Huang, J. (2009). Penalized methods for bi-level variable selection. Statistics and its Interface, 2(3), 369. doi: 10.4310/SII.2009.v2.n3.a10
  • Breheny, P., & Huang, J. (2015). Group descent algorithms for nonconvex penalized linear and logistic regression models with grouped predictors. Statistics and Computing, 25(2), 173–187. doi: 10.1007/s11222-013-9424-2
  • Fan, J., Feng, Y., & Song, R. (2011). Nonparametric independence screening in sparse ultra-high-dimensional additive models. Journal of the American Statistical Association, 106(494), 544–557. doi: 10.1198/jasa.2011.tm09779
  • Fan, J., & Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space (with discussion). Journal of the Royal Statistical Society, Series B: Statistical Methodology, 70(5), 849–911. doi: 10.1111/j.1467-9868.2008.00674.x
  • Fan, J., Samworth, R., & Wu, Y. (2009). Ultrahigh dimensional variable selection: Beyond the linear model. Journal of Machine Learning Research, 10, 1829–1853.
  • Fan, J., & Song, R. (2010). Sure independence screening in generalized linear models with NP-dimensionality. Annals of Statistics, 38(6), 3567–3604. doi: 10.1214/10-AOS798
  • He, X., Wang, L., & Hong, H. G. (2013). Quantile-adaptive model-free variable screening for high-dimensional heterogeneous data. Annals of Statistics, 41(1), 342–369. doi: 10.1214/13-AOS1087
  • Huang, J., Ma, S., Xie, H., & Zhang, C. H. (2009). A group bridge approach for variable selection. Biometrika, 96(2), 339–355. doi: 10.1093/biomet/asp020
  • Li, R., Zhong, W., & Zhu, L. (2012). Feature screening via distance correlation learning. Journal of the American Statistical Association, 107(499), 1129–1139. doi: 10.1080/01621459.2012.695654
  • Shao, X., & Zhang, J. (2014). Martingale difference correlation and its use in high-dimensional variable screening. The American Statistical Association, 109(507), 1302–1318. doi: 10.1080/01621459.2014.887012
  • Van der Vaart, A. W., & Wellner, J. A. (1996). Weak convergence and empirical processes. New York: Springer.
  • Wang, H. (2009). Forward regression for ultra-high dimensional variable screening. Journal of the American Statistical Association, 104(488), 1512–1524. doi: 10.1198/jasa.2008.tm08516
  • Wei, F., & Huang, J. (2010). Consistent group selection in high-dimensional linear regression. Bernoulli, 16(4), 1369–1384. doi: 10.3150/10-BEJ252
  • Yang, Y., & Zou, H. (2015). A fast unified algorithm for solving group-lasso penalize learning problems. Statistics and Computing, 2015(6), 1129–1141. doi: 10.1007/s11222-014-9498-5
  • Yuan, M., & Lin, Y. (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society, Series B: Statistical Methodology, 68(1), 49–67. doi: 10.1111/j.1467-9868.2005.00532.x
  • Zhang, C. H., & Huang, J. (2008). The sparsity and bias of the lasso selection in high-dimensional linear regression. The Annals of Statistics, 36(4), 1567–1594. doi: 10.1214/07-AOS520
  • Zhao, S. D., & Li, Y. (2012). Principled sure independence screening for Cox models with ultra-high-dimensional covariates. Journal of Multivariate Analysis, 105(1), 397–411. doi: 10.1016/j.jmva.2011.08.002
  • Zhao, P., Rocha, G., & Yu, B. (2009). The composite absolute penalties family for grouped and hierarchical variable selection. Annals of Statistics, 37(6A), 3468–3497. doi: 10.1214/07-AOS584
  • Zhong, W., & Zhu, L.-P. (2015). An iterative approach to distance correlation-based sure independence screening. Journal of Statistical Computation and Simulation, 85(11), 2331–2345. doi: 10.1080/00949655.2014.928820

Appendix 1.

Three lemmas

Next, we state some lemmas which will be used in the proof of Theorems 3.1 and 3.2.

Lemma A.1

Under conditions (i)–(iii), for any δ>0, j=1,2,,J, we have P1pj1nxjTyEXjYδn4pjexp{δ2/(c2n+c3δ)}, where c2=max(8M02M1,16M2) and c3=max13M0M1,1.

Proof.

Using Bonferroni's inequality, we can easily prove that P1pj1nxjTyEXjYδnPk=1pj1ni=1n(xijkyiEXjkY)2δ2n2k=1pjPi=1n(xijkyiEXjkY)δ. Thus we need to show that Pi=1n(xijkyiEXjkY)δ4exp{δ2/(c2n+c3δ)}, for every k=1,2,,pj. Recall that the support of Xjk is [a,b] for j=1,2,,J and k=1,2,,pj, we denote M0=max(|a|,|b|). Because yi=j=1JxijTβj+εi, we can obtain that xijkyiEXjkY=xijkj=1JxijTβjExijkj=1JxijTβj+xijkεi=ˆSijk1+Sijk2. Next we bound the tails probability of |Sijk1| and |Sijk2| respectively. By condition (ii)–(iii), it is easy to see that |Sijk1|M0M1,Var(Sijk1)M02M12,E|Sijk2|mEM0m|εi|mm!EexpM0|εi|M2m!(m2). Using the Bernstein's inequality (Van der Vaart & Wellner, Citation1996, lemma 2.2.9 and lemma 2.2.11), we conclude that (A1) Pi=1nSijk1>δ22expδ281nM02M12+M0M1δ/6,(A1) (A2) Pi=1nSijk2>δ22expδ2812nM2+δ/2.(A2) Therefore, we combine the results (EquationA1) and (EquationA2) with c2=max(8M02M1,16M2) and c3=max13M0M1,1, to obtain that Pi=1n(xijkyiEXjkY)δ4exp{δ2/(c2n+c3δ)}. This concludes the proof of the lemma.

Lemma A.2

Under conditions (ii)–(iv), for any δ>0 and j=1,2,,J, we have P1pj1nxjTxjEXjXjT1nδ2pj2expδ2c4n+c5δ where c4=2M04 and c5=4M023.

Proof.

For s=1,2,,pj, t=1,2,,pj, let Tj=1nxjTxjEXjXjT and Tj(s,t) be the entry of Tj. Then we can write Tj(s,t)=1ni=1n(xijsxijtEXjsXjt).

By the fact that ApA, we have (A3) P1pj1nxjTxjEXjXjT1nδP1nxjTxjEXjXjT1nδs=1pjt=1pjP|Tj(s,t)|δn.(A3) Next we also use Bernstein's inequality to bound the tails probability of Tj(s,t). By condition (ii)–(iii), we can obtain easily that |xijsxijtEXjsXjt|2M02,Var(xijsxijt)M04. Using Bernstein's inequality, it follows that (A4) P|Tj(s,t)|δn2exp12δ2nM04+2M02δ/3.(A4) Thus the desired result is obtained from (EquationA3) and (EquationA4) by taking c4=2M04 and c5=4M023.

Remark A.1

If A and B are two symmetric matrices of order p, we have the following two results (Fan, Feng, & Song, Citation2011; He et al., Citation2013): |λmin(A)λmin(B)|max{|λmin(AB)|,λmin(BA)|},|λmax(A)λmax(B)|max{|λmax(AB)|,λmax(BA)|}. In addition, note that |λmin(AB)||λmax(AB)|pAB. The above results, together with Lemma 2, imply that (A5) Pλmin1nxjTxjλmin(EXjXj)pjnδ2pj2expδ2c4n+c5δ,(A5) (A6) Pλmax1nxjTxjλmax(EXjXjT)pjnδ2pj2expδ2c4n+c5δ(A6) for j=1,2,,pj.

Lemma A.3

Suppose conditions (ii)–(iv) hold, there exist some positive constants τ3 and τ4 such that (A7) Pτ3λmin1nxjTxjλmax1nxjTxjτ412pj2expδ2c4n+c5δ.(A7) That is, with probability approaching 1, we have 0<τ3λmin1nxjTxjλmax1nxjTxjτ4<.

Proof.

Combing condition (iv) and (EquationA5)–(EquationA6), it is easy to obtain (EquationA7).

Appendix 2.

Proof of Theorem 3.1

Proof of Theorem 3.1.

The key idea of the proof is to show the uniform consistence of υˆnjn2 under conditions (ii)–(iv). As to the existing literatures, the sure screening property is typically established in this way. Recall that υˆnjn2=1nxjTyT1nxjTxj11nxjTy and υj2=(EXjY)T(EXjXjT)1(EXjY). Thus we need to evaluate 1pjυˆnjn21pjυj2=1pj1nxjTyT1nxjTxj11nxjTy1pj(EXjY)T(EXjXjT)1(EXjY). By some algebra, we decompose it into three parts υˆnjn2υj2=λ1+λ2+λ3 in which λ1=1nxjTyEXjYT1nxjTxj11nxjTyEXjY,λ2=1nxjyEXjYT1nxjTxj1EXjY,λ3=(EXjY)T1nxjTxj1(EXjXjT1nxjTxj)(EXjXjT)1EXjY. Now, we define a event Ωδ on which we have 1pj1nxjTyEXjYδn,1pj1nxjTxjEXjXjT1nδ,τ3λmin1nxjTxjλmax1nxjTxjτ4 for j=1,2,,J.

Then the above three lemmas indicate that (A8) P(Ωδ)14j=1Jpjexp{δ2/(c2n+c3δ)}4j=1Jpj2exp{δ2/(c4n+c5δ)}.(A8) By the fact that ABAB, we have on Ωδ, 1pj|λ1|1nxjTYEXjY21nxjTxj1δ2n21τ3,1pj|λ2|21nxjTyEXjY1nxjTxj1EXjYδn2M0M1τ3,1pj|λ3|1nxjTxj1(EXjXjT)1EXjXjT1nxjTxjEXjY2δnM0M1τ1τ3. Take δ=c6n1κ, there exists a constant c7 such that δ2n21τ3+δn2M0M1τ3+δnM0M1τ1τ3c6c7nκ. Choosing c6 such that c6c7c, we can easily obtain that 1pjυˆnjn21pjυj2cnκ. By invoking condition (i), we have on Ωδ for sufficiently large n 1pjυˆnjn22cnκ. If we choose πn2cnκ, it is easy to show that jM. This, together with (EquationA8), indicates that there exists a constant c1 such that P{MMκˆ}14j=1J(pj+pj2)exp{c1n12κ}.

Appendix 3.

Proof of Theorem 3.2

Proof of Theorem 3.2.

Following the similar argument of the proof of Theorem 3.1, we have on Ωδ 1jJ:1pjυˆnjn22cnκ1jJ:1pjυj2cnκ, where |.| denotes the size of the set. This implies that jMκˆ1pjυj2cnκ|Mκˆ|. By some algebra, it follows that |Mκˆ|O(nκj=1JEXjY2)=O(nκEXY2). That is, we have (A9) P{|Mκˆ|O(nκEXY2)}14j=1J(pj+pj2)exp{c1n12κ}.(A9) Thus the key point is to show that EXY2=O(1). For this purpose, we consider the following linear regression: minαE(YXTα)2

with respect to αRj=1Jpj. By least square, we can easily obtain that EXY2=αˆ[E(XTX)]2αˆλmax(Σ)αˆE(XTX)αˆ in which αˆ is the least square estimator. On the other hand, the orthogonal decomposition of least square implies that Var(Y)=Var(XTαˆ)+Var(YXTαˆ). Because Var(Y)=O(1), we conclude that (A10) EXY2O(1).(A10) Combining (EquationA9) and (EquationA10), the desired result can be easily obtained.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.