Search in:

Statistical Theory and Related Fields Volume 4, 2020 - Issue 1

Submit an article Journal homepage

Free access

609

Views

CrossRef citations to date

Altmetric

Listen

Articles

Group screening for ultra-high-dimensional feature under linear model

Yong Niua School of Statistics, East China Normal University, Shanghai, People's Republic of China;b Department of Mathematics and Physics, Hefei University, Hefei, People's Republic of ChinaView further author information

Riquan Zhanga School of Statistics, East China Normal University, Shanghai, People's Republic of ChinaCorrespondence[email protected]
[email protected]
View further author information

Jicai Liuc Department of Mathematics, Shanghai Normal University, Shanghai, People's Republic of ChinaView further author information

Huapeng Lia School of Statistics, East China Normal University, Shanghai, People's Republic of ChinaView further author information

Pages 43-54 | Received 18 Jul 2018, Accepted 17 Jun 2019, Published online: 04 Jul 2019

Cite this article
https://doi.org/10.1080/24754269.2019.1633763
CrossMark

In this article

1. Introduction
2. Group SIS and iterative algorithm
3. Theoretical properties
4. Numerical studies
5. Concluding remarks
Disclosure statement
Additional information
References
Appendixes

Full Article
Figures & data
References
Citations
Metrics
Reprints & Permissions
View PDF PDF View EPUB EPUB

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

Ultra-high-dimensional data with grouping structures arise naturally in many contemporary statistical problems, such as gene-wide association studies and the multi-factor analysis-of-variance (ANOVA). To address this issue, we proposed a group screening method to do variables selection on groups of variables in linear models. This group screening method is based on a working independence, and sure screening property is also established for our approach. To enhance the finite sample performance, a data-driven thresholding and a two-stage iterative procedure are developed. To the best of our knowledge, screening for grouped variables rarely appeared in the literature, and this method can be regarded as an important and non-trivial extension of screening for individual variables. An extensive simulation study and a real data analysis demonstrate its finite sample performance.

Keywords:

Ultra-high-dimensional
group screening
linear model
sure screening property

AMS 2000 Subject Classifications:

62G05
62E20

1. Introduction

Nowadays, grouping predictors arise naturally in many regression problems. It means that we are interested in finding relevant predictors in modelling the response variable, where each predictors may be represented by a group of indictor variables or a set of basis functions. Grouping structures can be introduced into a regression model naturally in hoping that the prior knowledge about predictors may be used to the full. Thus, grouping structure problems become increasingly important in various research fields. One common example is the representation of multi-level analysis-of-variance (ANOVA) in a regression model with a group of derived input variables. The aim of ANOVA is often to select relevant main factors and interactions, that is the selection of groups of derived input variables. Another example is the additive model with nonparametric components, where each component can be expressed a linear combination of a set of basis functions of the original predictors. Thus, in both cases, variable selection amounts to the selection of groups of variables rather than individual derived variables.

Using the penalised method, many researchers have considered the group selection problems in various parametric or nonparametric regression models. These articles include, but are not limited to the following. First, Bakin (Citation1999) proposed the group LASSO in his doctoral dissertation. Yuan and Lin (Citation2006) further studied the group LASSO and related group selection methods, such as the group LARS and the group Garrote, and proposed the corresponding algorithms. But they did not give any asymptotic properties of the group LASSO. Wei and Huang (Citation2010) showed that, under a generalised sparsity condition and the sparse Riesz condition proposed by Zhang and Huang (Citation2008), together with some regularity conditions, the group LASSO can select a model with the same order as the underlying model. They also established the asymptotic properties of the adaptive group LASSO, which can correctly select groups with probability tending to one. Under the assumption of generalised linear models, Breheny and Huang (Citation2009) established a general framework for simultaneous group and individual variable selection, or bi-level selection and the corresponding local coordinate descent algorithm. In addition to the group LASSO, many authors also proposed other methods for various parametric models. For example, Huang, Ma, Xie, and Zhang (Citation2009) showed that simultaneous group and individual variable selection can be conducted by a group bridge method. They showed that it can correctly selected relevant groups with probability tending to one. Moreover, Zhao, Rocha, and Yu (Citation2009) introduced a quite general composite penalty for groups selection by combing different norms to form an intelligent penalty. All these methods are very useful for moderate number of predictors to be smaller than the sample size or comparable with it. However, with rapid progress of computing power and modern technology for data collection, massive amounts of ultra-high-dimensional data are frequently seen in diverse fields of scientific research. Due to the “curse of dimensionality” in terms of simultaneous challenges to computational expediency, statistical accuracy and algorithm stability, the above methods are limited in handling ultra-high-dimensional problems.

In the seminal work of Fan and Lv (Citation2008), a new framework for sure independence screening (SIS) was established. They showed that the method based on Pearson correlation learning possess a sure screening property for linear regressions. That is, all relevant predictors can be selected with probability tending to one even if the number of predictors p can grow much faster than the number of observations n with log $p = O (n^{α})$ for some $α \in (0, \frac{1}{2})$ . Following Fan and Lv (Citation2008), we call this non-polynomial dimensionality or ultra-high dimensionality. From this on, various screening methods based on model assumption or model free have been developed (Fan & Lv, Citation2008; Fan, Samworth, & Wu, Citation2009; Fan & Song, Citation2010; He, Wang, & Hong, Citation2013; Li, Zhong, & Zhu, Citation2012; Shao & Zhang, Citation2014; Wang, Citation2009; Zhao & Li, Citation2012). However, all these screening methods deal with the individual variables rather than grouped predictors. To the best of our knowledge, screening methods for grouped predictors are quite limited in existing literatures. Thus it is very important to propose a new screening method based on grouped predictors.

Motivated by the theory of SIS, we consider how to deal with these ultra-high-dimensional grouped predictors in the assumption of linear regression model. Considering grouping structures in linear regression model, we have (1) $Y = \sum_{j = 1}^{J} X_{j}^{T} β_{j} + ε,$ (1) where Y is the response variable, $X_{j} = (X_{j 1}, \dots, X_{j p_{j}})^{T}$ is a $p_{j} \times 1$ random vector representing the jth group, $β_{j} = (β_{j 1}, \dots, β_{j p_{j}})^{T}$ is the $p_{j} \times 1$ parameter vector corresponding to the jth group predictors, and $ε$ is the random error with mean 0.

Our method is a two-stage approach. First, an efficient screening procedure is employed to reduce the number of group predictors to a moderate order under sample size, and then the existing group selection methods can be used to recover the final sparse model. Due to fastness and efficiency of the screening method for group predictors, we consider a independence screening method by ranking the magnitude of marginal estimators based on each grouped predictor. That is, we fit p marginal linear regressions of the response Y against the variables of the jth group respectively, and the select the relevant group predictors by a measure of the goodness of fit in its marginal linear regression model. Under some mild conditions, We show that there is a significant difference between relevant group predictors and irrelevant ones, according to the strength of these marginal utility. Thus we can distinguish active group predictors from much more inactive ones. Next, the existing group selection methods, such as the group LASSO, the group SCAD (Breheny & Huang, Citation2015) and the group MCP (Breheny & Huang, Citation2015), can be used to obtain the final sparse model. We refer to our screening procedure as the Group-SIS, and theoretically establish the sure screening property of our approach. In order to further reduce the false-positive rate, we propose a iterative version of algorithm, named ISIS-Group-Lasso. To enhance performance and speed up the computation of ISIS-Group-Lasso, a greedy modification to the above iterative algorithm, named g-ISIS-Group-Lasso is also developed. Our simulation studies indicate that ISIS-Group-Lasso and g-ISIS-Group-Lasso significantly outperform the competitive group selection, such as distance correlation-based screening method and the group LASSO, especially when the dimensionality is ultra-high.

The rest of the article is organised as follows. In Section 2, we introduce a marginal group SIS in linear regression models. Under some mild conditions, the sure screening property and model selection consistency of the Group-SIS will be established in Section 3. In Section 4, simulation studies and a real data analysis are carried out to assess the performance of our method. Concluding remarks are given in Section 5. All technical proofs for the main theoretical results are given in the Appendix.

2. Group SIS and iterative algorithm

2.1. Marginal linear regression based on grouped predictors

Suppose that we have n random sample from model (Equation1(1) $Y = \sum_{j = 1}^{J} X_{j}^{T} β_{j} + ε,$ (1) ) of the form (2) $y_{i} = \sum_{j = 1}^{J} x_{i j}^{T} β_{j} + ε_{i}, i = 1, 2, \dots, n,$ (2) in which $x_{i j} = (x_{i j 1}, \dots, x_{i j p_{j}})^{T}$ , $y_{i}$ and $ε_{i}$ are scalar. Let $y = (y_{1}, \dots, y_{n})^{T}$ , $x_{j} = (x_{1 j}, \dots, x_{n j})^{T}$ and $ε = (ε_{1}, \dots, ε_{n})^{T}$ , where $x_{j}$ is the $n \times p_{j}$ design matrix corresponding to the jth group for each $j = 1, 2, \dots, p$ . Then the model (Equation2(2) $y_{i} = \sum_{j = 1}^{J} x_{i j}^{T} β_{j} + ε_{i}, i = 1, 2, \dots, n,$ (2) ) can be rewritten as (3) $y = \sum_{j = 1}^{J} x_{j} β_{j} + ε .$ (3) For simplicity, the number of variables in each group is uniformly bounded. That is, there exists a positive constant K such that $p_{j} \leq K$ for $j = 1, 2, \dots, J$ . To rapidly select the relevant grouped predictors, we consider the following J marginal linear regressions against the grouped predictors: (4) $min_{γ_{j} \in R^{p_{j}}} \sum_{i = 1}^{n} (y_{i} - x_{i j}^{T} γ_{j})^{2},$ (4) where $γ_{j} = (γ_{j 1}, \dots, γ_{j p_{j}})^{'}$ is a $p_{j}$ -dimensional vector for each $j = 1, 2, \dots, J$ .

It is easy to see that the minimiser of (Equation4(4) $min_{γ_{j} \in R^{p_{j}}} \sum_{i = 1}^{n} (y_{i} - x_{i j}^{T} γ_{j})^{2},$ (4) ) is given by ${\hat{γ}}_{j} = (x_{j}^{T} x_{j})^{- 1} x_{j}^{T} y .$ Then we define the marginal utility of the jth grouped predictors as (5) $∥ {\hat{υ}}_{n j} ∥_{n}^{2} \equiv \frac{1}{n} \sum_{i = 1}^{n} ({\hat{γ}}_{j}^{T} x_{i j})^{2} = \frac{1}{n} y^{T} x_{j} (x_{j}^{T} x_{j})^{- 1} x_{j}^{T} y .$ (5) We now select a set of relevant grouped predictors as follows: ${\hat{M}}_{κ} = \{1 \leq j \leq J : \frac{1}{p_{j}} ∥ {\hat{υ}}_{n j} ∥_{n}^{2} \geq π_{n}\},$ where $κ$ is a positive constant and $π_{n}$ is a pre-specified threshold value which will be given later.

Equivalently, we can also define a screening criterion by ranking the residual sum of squares of the corresponding marginal linear regression. These two ways can reduce the group dimensionality from J to a moderate size $| {\hat{M}}_{κ} |$ . Here, the pre-specified threshold value is crucial in the screening procedure. If we choose it too small, we may select many irrelevant grouped predictors in the final model. On the contrary, we have the risk of losing some important variables. In a word, we should select all of the relevant grouped predictors and control the selected model size simultaneously. In Section 3, we will theoretically show that this group screening approach possesses a sure screening property and the final model size is only of polynomial order. Noted that, our method is the same as traditional feature screening when each group has one variable. In this sense, our method can be regarded as a non-trivial extension of feature screening under the context of single feature screening.

2.2. Iterative group-SIS algorithm

For these ultra-high-dimensional group variable selection problems, we propose a two-stage procedure. That is, we first apply a sure screening method such as Group-SIS to reduce the number of groups from J to a relatively large scale d, where the dimensionality of the selected d group is below sample size n. Then we can use a lower dimensional group-wise variable selection procedure, such as group Lasso, group SCAD or group MCP. In this article, we use group lasso penalty as our group selection strategy. In fact, other group variable selection methods would also work.

However, as Fan and Lv (Citation2008) point out, this marginal independence screening method would still suffer from false negative (i.e., miss some important group predictors that are marginally uncorrelated, but jointly correlated with response), and false positive (i.e., select some unimportant group predictors which have higher marginal correlation than some important group variables). Therefore, we propose an iterative framework to enhance the finite performance of this screening method. That is, we can iteratively use a large-scale group screening and moderate-scale group variable selection strategy.

To obtain a data-driven thresholding for independence group screening, we extend the random permutation idea of Zhao and Li (Citation2012), which select a small proportion of inactive variables to enter the model in each screening step. Let $x = (x_{1}, x_{2}, \dots, x_{J})$ , randomly permute the row of $x$ to get the decouple data $\tilde{x}$ and $y$ . Based on the randomly decoupled data $(\tilde{x}, y)$ , which has no relationship between group variables and response, we compute the value of $∥ {\hat{υ}}_{n j}^{*} ∥_{n}^{2}$ similar to $∥ {\hat{υ}}_{n j} ∥_{n}^{2}$ for $j = 1, 2, \dots, J$ . These values serve as the baseline of the marginal group screening utilities under the null model (no relationship between group variables and response). To obtain the screening threshold, we choose $ω_{q}$ as the q-ranked magnitude of ${∥ {\hat{υ}}_{n j}^{*} ∥_{n}^{2}, j = 1, 2, \dots, J}$ . In our simulation, we uses q=1, namely, the largest marginal group screening utilities under the null model. For the sake of completeness, our ISIS-Group-Lasso algorithm proceeds as follows.

Step 1. Compute J marginal utility $∥ {\hat{υ}}_{n j} ∥_{n}^{2}$ , and the initial index subset is chosen as $A_{1} = \{1 \leq j \leq J : \frac{1}{p_{j}} ∥ {\hat{υ}}_{n j} ∥_{n}^{2} \geq ω_{q}\} .$

Step 2. Apply the group Lasso (Breheny & Huang, Citation2009) on the index subset $A_{1}$ to obtain a subset $M_{1}$ . In this step, we choose the regularisation parameter by the Bayesian Information Criterion (BIC) method.

Step 3. Conditioning on $M_{1}$ , compute the marginal regression ${\hat{υ}}_{n j} = min_{γ_{j} \in R^{p_{j}}} \sum_{i = 1}^{n} {(y_{i} - \sum_{\tilde{k} \in M_{1}} x_{i \tilde{k}}^{T} γ_{\tilde{k}} - x_{i j}^{T} γ_{j})}^{2}$ for each $j \in M_{1}^{c}$ . By randomly permuting only the groups not in $M_{1}$ , we obtain a new index subset $A_{2}$ similar to step 1. Apply the group LASSO on the index subset $A_{2} \cup M_{1}$ to obtain a new subset $M_{2}$ .

Step 4. Repeat the process until we have the final index set $A_{k}$ such that $| A_{k} | \geq k_{o}$ or $A_{k} = A_{l}$ for some $l < k .$

In order to further reduce false positive and speed up computation, we propose a greedy modification to enhance the finite performance of the above algorithm. Specifically, we restrict the number of the selected groups in the iterative screening steps to be at most $J_{0}$ , a small positive integer, and the procedure stops when none of the group predictors is recruited. In our simulation, we set $J_{0} = 1$ . This greedy version of ISIS-Group-Lasso algorithm is called g-ISIS-Group-Lasso. When $J_{0} = 1$ , this method is connected with forward regression screening (Wang, Citation2009), which select at most one new group predictor into the model at a time. However, there is a great difference between the two methods, that is our method includes a deletion step via group selection that can remove multiple group predictors. This makes our procedure more effective, because it is more flexible in terms of recruiting and deleting group predictors. Based on our simulation results in Section 4, g-ISIS-Group-Lasso outperforms other methods in terms of lower false-positive rate, higher percentage of selected the corrected model and small model error.

3. Theoretical properties

Before we establish the sure screening property of our method for linear models, let us introduce some notations first. Denote the Euclidean and the sup norm of a vector $α$ by $∥ α ∥$ and $∥ α ∥_{\infty}$ , respectively. For any symmetric matrix A, let $∥ A ∥_{\infty} = max_{i, j} A_{i j}$ be the infinity norm and $∥ A ∥$ the operator norm. Let $λ_{min} (A)$ and $λ_{max} (A)$ be the minimum and maximum eigenvalue of the matrix A. Let $X = (X_{1}^{T}, X_{2}^{T}, \dots, X_{J}^{T})^{T}$ and $E (X X^{T}) = Σ$ . For each $j = 1, 2, \dots, J$ and $k = 1, 2, \dots, p_{j}$ , let $[a, b]$ be the support of $X_{j k}$ . Define the index set of the truly model $M_{*}$ by $M_{*} = {1 \leq j \leq J : β_{j} \neq 0} .$ To gain theoretical insights into the Group-SIS, we need to define $υ_{j}$ , which is the population version of (4), by minimising $min E (Y - υ_{j})^{2} \equiv min_{γ_{j} \in R^{p_{j}}} E (Y - X_{j}^{T} γ_{j})^{2}$ with respect to $γ_{j} \in R^{p_{j}}$ . Then we have $υ_{j} = X_{j}^{T} (E X_{j} X_{j}^{T})^{- 1} E X_{j} Y .$ Similar to (Equation5(5) $∥ {\hat{υ}}_{n j} ∥_{n}^{2} \equiv \frac{1}{n} \sum_{i = 1}^{n} ({\hat{γ}}_{j}^{T} x_{i j})^{2} = \frac{1}{n} y^{T} x_{j} (x_{j}^{T} x_{j})^{- 1} x_{j}^{T} y .$ (5) ), we define $\begin{aligned} ∥ υ_{j} ∥^{2} & \equiv E (X_{j}^{T} (E X_{j} X_{j}^{T})^{- 1} E X_{j} Y)^{2} \\ = (E X_{j} Y)^{T} (E X_{j} X_{j}^{T})^{- 1} (E X_{j} Y) . \end{aligned}$ Next we collect the technical assumptions to establish the sure screening property of our group screening method.

$\frac{1}{p_{j}} min_{j \in M_{*}} ∥ υ_{j} ∥^{2} \geq c n^{- κ}$ , for some $0 < κ < \frac{1}{2}$ and c>0.
$∥ \sum_{j = 1}^{J} X_{j}^{'} β_{j} ∥_{\infty} < M_{1}$ for $M_{1} > 0$ .
For any B>0 and $i = 1, 2, \dots, n$ , there is a positive constant $M_{2}$ such that $E [\exp {B | ε_{i} |}] < M_{2}$ .
For $j = 1, 2, \dots, J$ , the eigenvalues of $Σ_{j} = E X_{j} X_{j}^{T}$ are bounded away from zero and infinity. That is, there are some positive constants $τ_{1}$ and $τ_{2}$ such that $0 < τ_{1} \leq λ_{min} (Σ_{j}) \leq λ_{max} (Σ_{j}) \leq τ_{2} < \infty$ .

Under condition (i), we obtain the minimum signal of the relevant grouped predictors. That is, the magnitude of these marginal utilities of the grouped predictors can preserve the non-sparsity signal of the real model with the fastest convergence rate. This condition is often seen in screening literatures, which is important as it guarantees that marginal utilities carry information about the relevant covariates in the active set. And, conditions (ii) and (iii) are two mild conditions needed in using Berstein's inequality (Van der Vaart & Wellner, Citation1996). Because we allow $| M_{κ} |$ increase with n, condition (ii) ensures the convergence of $\sum_{j = 1}^{J} x_{j}^{T} β_{j}$ . Condition (iv) is also easy to be satisfied, for the small number of variables in each grouped predictor.

Remark 3.1

The above assumptions only serve to help us to further understand the new group screening methodology. Thus these conditions are imposed to facilitate the technical proofs, and the weaker condition may be an interesting topic for future research.

The following Theorem 3.1 provides the sure screening property of our group screening method.

Theorem 3.1

Suppose that conditions (i)–(iv) hold. There exists a constant $c_{1},$ such that we have $P {M_{*} \subset \hat{M_{κ}}} \geq 1 - 4 \sum_{j = 1}^{J} (p_{j} + p_{j}^{2}) \exp {- c_{1} n^{1 - 2 κ}} .$

Theorem 3.1 indicates that we can select all the relevant grouped predictors with probability tending to 1, and the key to the theorem's proof is on how to obtain the uniform consistence of $∥ {\hat{υ}}_{j n} ∥^{2}$ to $∥ υ_{j n} ∥^{2}$ . Denote $\sum_{j = 1}^{J} p_{j} = p$ . Note that $p_{j} \leq K$ uniformly, the dimensionality can be handled as high as $\log p = o (n^{1 - 2 κ}) .$ That is, under some mild conditions, our method has the sure screening property and can reduce from the exponentially growing dimension p to a relatively moderate scale which will be applied in the next group variable selection. Especially, we should point out that $κ$ is very important to our screening procedure. The greater $κ$ is, the higher number of groups that our method can deal with.

On the other hand, although we can select the relevant grouped predictors with probability tending to 1, the cardinality of the $\hat{M_{κ}}$ may be relatively large compared with the sample size. That is, there are many unimportant grouped predictors in the final model. Thus controlling the false-positive rate is also necessary for our method. From the perspective in simulation, we have proposed an iterative algorithm in Section 2.2 to enhance the performance of our method in terms of the false selection rates. Under the same conditions as in Theorem 3.1, we will show theoretically that the size of the final model is as large as $n^{κ} λ_{max} (Σ)$ with probability tending to one exponentially.

Theorem 3.2

Suppose that conditions (i)–(iv) hold, there exists some positive constant $c_{0},$ $\begin{aligned} P {| \hat{M_{κ}} | & \leq c_{0} n^{κ} λ_{max} (Σ)} \\ \geq 1 - 4 \sum_{j = 1}^{J} (p_{j} + p_{j}^{2}) \exp {- c_{1} n^{1 - 2 κ}} . \end{aligned}$

Theorem 3.2 shows that, with probability tending to 1, the size of the selected model by our procedure is as large as polynomial order when $λ_{max} (Σ)$ is of polynomial order. It is crucial to the next group selection stage, which make our two-stage approach much better than traditional group variable selection methods. Because there is no guarantee that existing group selection methods can select the relevant grouped predictors consistently if there are too irrelevant grouped predictors. Thus this theorem, together with Theorem 3.1, implies we can select a model which includes all relevant grouped predictors and a small number of irrelevant ones with high probability.

4. Numerical studies

4.1. Simulation results

In this section, we carry out some simulation studies to demonstrate the finite sample performance of our group screen methods described in Section 2. We consider two group size scenarios of simulation models. In the first scenario, the group sizes are equal. In the second, the group sizes vary. In our simulation, we set all groups with size 5 or 3. We set the sample size n=200, and the following three configurations with $J = 200,$ 400, 1000 groups are considered for generating the covariates $(x_{1}, x_{2}, \dots, x_{J})$ . For example, when J=200 and group size is 5, the final predictor matrix has the number of variables $p = 1000.$ To gauge the difficulties of the simulation models, different scenarios of signal-to-noise ratio (SNR) are given in Examples 4.1–4.4, where $S N R = V a r (\sum_{j = 1}^{J} X_{j}^{T} β_{j}) / V a r (ϵ) .$ It is obviously that the larger the value of SNR, the higher probability that our group screening method can select the relevant groups. In all examples, the simulation results are based on 200 replications for each parameter setup.

To further explore the finite sample performance of our methods, we create some unimportant group variables highly correlated with the response due to the presence of the important group variables associated with the spurious group variables. The correlation between group variables can be specified as follows.

The group vector is $X = (X_{1}^{T}, \dots, X_{J}^{T})^{T}$ where $X_{j} = (X_{j 1}, \dots, X_{j p_{j}})^{T}$ , $j = 1, 2, \dots, J$ . In the following four examples, we set two different group size and three configurations of the number of groups are considered. For example, we set $p_{j} = 5$ for each $j = 1, \dots, J$ . To generate $X$ , we first simulate $5 J$ random variables $T_{1}, \dots, T_{5 J}$ independently from $N (0, 1)$ . Then $Z_{1}, \dots, Z_{50}$ are simulated from a multivariate normal distribution with mean 0 and $C o v (Z_{j_{1}}, Z_{j_{2}}) = {0.6}^{| j_{1} - j_{2} |}$ . For $k = 1, \dots, 5$ , the group variable $X_{j k}$ are generated as $X_{j k} = \{\begin{cases} \frac{Z_{j} + T_{5 (j - 1) + k}}{\sqrt{2}}, & j = 1, \dots, 50, \\ T_{5 (j - 1) + k}, & j = 51, \dots, J . \end{cases}$ Here, the group components were the relevant variables, while most of components were spurious variables not used in the model but correlated to the relevant group variables. The random error $ε_{i}$ was generated from a standard normal distribution. To ensure that the theoretical value of SNR was not too weak or strong, $ε_{i}$ can be multiplied by a constant σ.

Extensive simulation studies have been conducted to demonstrate the finite performance of our group screening method. For comparison purpose, the performance of distance correlation screening (DC-SIS) (Li et al., Citation2012), which is a model-free screening method that uses the distance correlation to replace Pearson correlation in marginal correlation screening, is examined. For the sake of fairness, we propose first to apply DC-SIS to reduce the number of groups to $\frac{n}{\log n}$ , and a group-wise variable selection procedure such as group Lasso is conducted to recover the final model. We call DC-SIS followed by group Lasso DC-SIS-Group-Lasso. To enhance the performance of DC-SIS, Zhong and Zhu (Citation2015) propose an iterative version of DC-SIS, named by DC-ISIS, which will also be chosen as a comparison. For the sake of fairness, group Lasso is conducted after DC-ISIS, referring to DC-ISIS-Group-Lasso. At the same time, the performance of group Lasso was also examined. Thus, we have five group screening method (i.e., ISIS-Group-Lasso, g-ISIS-Group-Lasso, Group-Lasso, DC-SIS-Group-Lasso, DC-ISIS-Group-Lasso) under consideration. In the following four examples, we report five performance measures: true positive (TP), false positive (FP), median of the model size (MEDIAN), percentage of occasions on which the exactly correct groups are selected (CORRECT) and model error (Yuan & Lin, Citation2006).

Example 4.1

Each group consists of five variables, and the number of the relevant groups is 4. And we generate the response from the following linear model: $Y = X_{1}^{T} β_{1} + X_{2}^{T} β_{2} + X_{3}^{T} β_{3} + X_{4}^{T} β_{4} + σ ε,$ where $\begin{aligned} β_{1} & = (1.5, 1.5, 1, 1, - 0.5)^{T}, \\ β_{2} & = (1.5, - 0.5, 0.5, 2, 0.5)^{T}, \\ β_{3} & = (2, - 1, 1.5, 1.5, 2)^{T}, \\ β_{4} & = (- 1, - 2, - 2, 0.5, - 1)^{T}, \\ β_{5} & = β_{6} = \dots = β_{J} = (0, 0, 0, 0, 0)^{T} . \end{aligned}$

Example 4.2

Similar to Example 4.1, the number of the relevant groups is 8 with group size 3. We generate the response from the following linear model: $\begin{aligned} Y & = X_{1}^{T} β_{1} + X_{2}^{T} β_{2} + X_{3}^{T} β_{3} + X_{4}^{T} β_{4} + X_{5}^{T} β_{5} \\ + X_{6}^{T} β_{6} + X_{7}^{T} β_{7} + X_{8}^{T} β_{8} + σ ε, \end{aligned}$ where $\begin{aligned} β_{1} & = (0.5, - 2, - 2)^{T}, β_{2} = (1, 3, 1)^{T}, \\ β_{3} & = \sqrt{2} * (1.5, - 0.5, 2)^{T}, \\ β_{4} & = \sqrt{2} * (1, - 1.5, - 2)^{T}, \\ β_{5} & = (0.5, - 2, - 2)^{T}, β_{6} = (1, 3, 1)^{T}, \\ β_{7} & = \sqrt{2} * (1.5, - 0.5, 2)^{T}, \\ β_{8} & = \sqrt{2} * (1, - 1.5, - 2)^{T}, \\ β_{9} & = β_{10} = \dots = β_{J} = (0, 0, 0)^{T} . \end{aligned}$

Example 4.3

In this example, the group sizes differ across groups. There are half of the groups with size 5 and the other groups with size 3. The group variables are generated the same way as the above examples. The response variable Y is generate from $Y = \sum_{k = 1}^{4} X_{k}^{T} β_{k} + σ ε$ . However, the regression coefficients $\begin{aligned} β_{1} = (0.5, 0.5, - 0.5, 2, 1)^{T}, β_{2} = (2, 0, 1, 1.5, - 1)^{T}, \\ β_{3} = (0.5, - 2, - 2)^{T}, β_{4} = (1, 3, 1)^{T}, \\ β_{5} = \dots = β_{0.5 * (J - 4) + 4} = (0, 0, 0, 0, 0)^{T}, \\ β_{0.5 * (J - 4) + 5} = \dots = β_{J} = (0, 0, 0)^{T} . \end{aligned}$

Example 4.4

In this example, the group sizes also differ across groups. This example is a more difficult case than Example 4.3, because it has eight groups with more different regression coefficients. There are $0.5 * J$ groups with size 5 and the other groups with size 3. The group variables are generated the same way as the above Example 4.3. The response variable Y is generate from $Y = \sum_{k = 1}^{8} X_{k}^{T} β_{k} + σ ε$ . However, the regression coefficients $\begin{aligned} β_{1} = (1.5, 1.5, 1, 1, - 0.5)^{T}, \\ β_{2} = (1.5, - 0.5, 0.5, 1.5, 0.5)^{T}, \\ β_{3} = (1.5, - 1, 1, 1, 2)^{T}, \\ β_{4} = (- 1, - 1.5, - 1.5, 0.5, - 1)^{T}, \\ β_{5} = (0.5, - 2, - 2)^{T}, β_{6} = (1, 3, 1)^{T}, \\ β_{7} = \sqrt{2} * (0.5, - 2, - 2)^{T}, β_{8} = \sqrt{2} * (1, 3, 1)^{T}, \\ β_{9} = \dots = β_{0.5 * (J - 8) + 8} = (0, 0, 0, 0, 0)^{T}, \\ β_{0.5 * (J - 8) + 9} = \dots = β_{J} = (0, 0, 0)^{T} . \end{aligned}$

Detailed simulation results of Examples 4.1–4.4 are given in Tables , respectively. Especially, the boxplots of average model size are presented in Figure . Obviously, in all these four examples, the relevant groups can almost be selected for four methods except for DC-SIS-Group-Lasso, which misses more groups most of the time. In terms of true positives (TP), the iterative version of DC-SIS-Group-Lasso performs well at the cost of increasing the size of the model, which leads to large FP and ME. On the other hand, the number of false-positive groups selected by group Lasso is much larger than the other four methods. Compared with distance correlation-based screening methods, our approaches have better finite performance. This may be due to the reasons for their methods based on model-free framework, while our screening methods take full advantage of the assumptions of the linear model. Especially for the greedy modification, g-ISIS-Group-Lasso, the size of final selected model is much smaller than the other four methods in all examples. Just because of this, g-ISIS-Group-Lasso outperforms its competitors in terms of the percentage of correct selected model. On the other hand, the simulations show that the value of SNR has an important impact on the results of the three methods. In Examples 4.2 and 4.4, there are eight relevant groups, while the number of relevant groups in the other two examples is 4. It means that screening relevant groups in these two examples is difficult than the other two. Thus, if we want achieve sure screening, the value of SNR in these two examples is much larger than the other two.

Figure 1. Boxplots of average model sizes for Example 4.1 under different group number.

4.2. Real example

In this section, we compare ISIS-Group-Lasso, g-ISIS-Group-Lasso and Group-Lasso on colon data (Alon et al., Citation1999). Alon's work reports the application of a two-way clustering method for analysis a data set consisting of the expression patterns of different cell types. For these data, we were interested in finding the genes that are related to colon tumour. In the original colon data, the identity of the 62 samples from colon-cancer patients were analysed with an Affymetrix oligonucleotide Hum6000 array. And these data contain the expression of the 2000 genes with the highest minimal intensity across the 62 tissues, where the genes are placed in order of descending minimal intensity. That is, the original data have 2000 numerical variables. For each continuous variable in the additive model, we use five B-spline basis functions to represent its effect, which is originally used by Yang and Zou (Citation2015) in solving group-lasso penalise learning problems. Thus we obtain 10,000 predictors in 2000 groups after basis function expansion. Three methods are used to select relevant additive components.

Before the analysis, all data are standardised in advance such that each variable has zero mean and unit sample variance. To valuate the performance of the three methods, we used cross-validation and compared the average model size (AMS) and the prediction mean squared error (PE). We randomly partitioned the data into a training data set of 50 observations and a test set of 12 observations. That is, we conduct group screening using 50 observations and the PEs on these 12 test sets. Detailed results based on 100 replications are presented in Table . In addition, the boxplot of the average model size is presented in Figure . As clearly shown in Table , ISIS-Group-Lasso and g-ISIS-Group-Lasso select far fewer genes than Group-Lasso, while the first two methods have a smaller PE. In conclusion, the proposed iterative group screening approach is very useful in high-dimensional scientific studies, which can select a parsimonious model and reveal interesting relationship between group variables.

Figure 2. Boxplot of average model sizes for colon data analysis.

Table 5. Results of AMS and CORRECT for colon data.

Download CSV Display Table

5. Concluding remarks

In this article, we have proposed the marginal group sure screening method under the context of ultra-high dimensionality. Unlike most existing literatures, we deal with variables, which can be naturally grouped. Our group screening method respects the grouping structure in the data and is based on a working independence. Theoretically, we establish the sure screening property for this group screening approach. To enhance the finite sample performance, a data-driven thresholding and an iterative procedure, ISIS-Group-Lasso, are developed. A greedy modification to the iterative procedure, g-ISIS-Group-Lasso is also proposed to further reduce the false positive. Simulation results show that these two methods perform well in terms of the five performance measures.

This article leaves the problems of extending the ISIS-Group-Lasso and g-ISIS-Group-Lasso under linear model to the family of generalised linear model and other parametric models. And model-free group screening approach may be appealing for dealing with ultra-high-dimensional data more generally, which avoids the difficult task of specifying the form of a statistical model. These problems are beyond the scopes of this article and are interesting topics for future research.

Disclosure statement

No potential conflict of interest was reported by the authors.

Additional information

Funding

This research was supported by the National Natural Science Foundation of China (CN) (11571112), the National Social Science Foundation Key Program (17ZDA091) and Natural Science Fund of Education Department of Anhui Province (KJ2013B233), the 111 Project of China (B14019).

Notes on contributors

Yong Niu

Yong Niu is a PhD candidate in the College of Statistics, East China Normal University, Shanghai, China. His research interests include high dimensional data, big data analytics and nonparametric statistics.

Riquan Zhang

Riquan Zhang is a professor and chair of School of Statistics in East China Normal University. His research interests include high dimensional data, big data analytics, functional data analysis, statistical machine learning and nonparametric statistics.

Jicai Liu

Jicai Liu is an associate professor of statistics in the department of mathematics at Shanghai Normal University, China. His research interests include high dimensional data, lifetime data analysis and nonparametric statistics.

Huapeng Li

Huapeng Li is an associate professor of statistics in the school of mathematics and statistics at Datong University, China. His research interests include nonparametric and semiparametric statistics based on empirical likelihood, selection biased data and finite mixture models.

References

Alon, U., Barkai, N., Notterman, D. A., Gish, K., Ybarra, S., Mack, D., & Levine, A. J. (1999). Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academy of Sciences, 96(12), 6745–6750. doi: 10.1073/pnas.96.12.6745
PubMed Web of Science ®Google Scholar
Bakin, S. (1999). Adaptive regression and model selection in data mining problems (Ph.D. thesis). Australian National University, Canberra.
Google Scholar
Breheny, P., & Huang, J. (2009). Penalized methods for bi-level variable selection. Statistics and its Interface, 2(3), 369. doi: 10.4310/SII.2009.v2.n3.a10
PubMed Web of Science ®Google Scholar
Breheny, P., & Huang, J. (2015). Group descent algorithms for nonconvex penalized linear and logistic regression models with grouped predictors. Statistics and Computing, 25(2), 173–187. doi: 10.1007/s11222-013-9424-2
PubMed Web of Science ®Google Scholar
Fan, J., Feng, Y., & Song, R. (2011). Nonparametric independence screening in sparse ultra-high-dimensional additive models. Journal of the American Statistical Association, 106(494), 544–557. doi: 10.1198/jasa.2011.tm09779
PubMed Web of Science ®Google Scholar
Fan, J., & Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space (with discussion). Journal of the Royal Statistical Society, Series B: Statistical Methodology, 70(5), 849–911. doi: 10.1111/j.1467-9868.2008.00674.x
PubMed Web of Science ®Google Scholar
Fan, J., Samworth, R., & Wu, Y. (2009). Ultrahigh dimensional variable selection: Beyond the linear model. Journal of Machine Learning Research, 10, 1829–1853.
Web of Science ®Google Scholar
Fan, J., & Song, R. (2010). Sure independence screening in generalized linear models with NP-dimensionality. Annals of Statistics, 38(6), 3567–3604. doi: 10.1214/10-AOS798
Web of Science ®Google Scholar
He, X., Wang, L., & Hong, H. G. (2013). Quantile-adaptive model-free variable screening for high-dimensional heterogeneous data. Annals of Statistics, 41(1), 342–369. doi: 10.1214/13-AOS1087
Web of Science ®Google Scholar
Huang, J., Ma, S., Xie, H., & Zhang, C. H. (2009). A group bridge approach for variable selection. Biometrika, 96(2), 339–355. doi: 10.1093/biomet/asp020
PubMed Web of Science ®Google Scholar
Li, R., Zhong, W., & Zhu, L. (2012). Feature screening via distance correlation learning. Journal of the American Statistical Association, 107(499), 1129–1139. doi: 10.1080/01621459.2012.695654
PubMed Web of Science ®Google Scholar
Shao, X., & Zhang, J. (2014). Martingale difference correlation and its use in high-dimensional variable screening. The American Statistical Association, 109(507), 1302–1318. doi: 10.1080/01621459.2014.887012
Web of Science ®Google Scholar
Van der Vaart, A. W., & Wellner, J. A. (1996). Weak convergence and empirical processes. New York: Springer.
Google Scholar
Wang, H. (2009). Forward regression for ultra-high dimensional variable screening. Journal of the American Statistical Association, 104(488), 1512–1524. doi: 10.1198/jasa.2008.tm08516
Web of Science ®Google Scholar
Wei, F., & Huang, J. (2010). Consistent group selection in high-dimensional linear regression. Bernoulli, 16(4), 1369–1384. doi: 10.3150/10-BEJ252
PubMed Web of Science ®Google Scholar
Yang, Y., & Zou, H. (2015). A fast unified algorithm for solving group-lasso penalize learning problems. Statistics and Computing, 2015(6), 1129–1141. doi: 10.1007/s11222-014-9498-5
Google Scholar
Yuan, M., & Lin, Y. (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society, Series B: Statistical Methodology, 68(1), 49–67. doi: 10.1111/j.1467-9868.2005.00532.x
Web of Science ®Google Scholar
Zhang, C. H., & Huang, J. (2008). The sparsity and bias of the lasso selection in high-dimensional linear regression. The Annals of Statistics, 36(4), 1567–1594. doi: 10.1214/07-AOS520
Web of Science ®Google Scholar
Zhao, S. D., & Li, Y. (2012). Principled sure independence screening for Cox models with ultra-high-dimensional covariates. Journal of Multivariate Analysis, 105(1), 397–411. doi: 10.1016/j.jmva.2011.08.002
PubMed Web of Science ®Google Scholar
Zhao, P., Rocha, G., & Yu, B. (2009). The composite absolute penalties family for grouped and hierarchical variable selection. Annals of Statistics, 37(6A), 3468–3497. doi: 10.1214/07-AOS584
Web of Science ®Google Scholar
Zhong, W., & Zhu, L.-P. (2015). An iterative approach to distance correlation-based sure independence screening. Journal of Statistical Computation and Simulation, 85(11), 2331–2345. doi: 10.1080/00949655.2014.928820
Web of Science ®Google Scholar

Appendix 1.

Three lemmas

Next, we state some lemmas which will be used in the proof of Theorems 3.1 and 3.2.Lemma A.1 Under conditions (i)–(iii), for any

δ > 0,

j = 1, 2, \dots, J,

we have

\begin{aligned} P \{\frac{1}{\sqrt{p_{j}}} ∥\frac{1}{n} x_{j}^{T} y - E X_{j} Y∥ \geq \frac{δ}{n}\} \\ \leq 4 p_{j} \exp {- δ^{2} / (c_{2} n + c_{3} δ)}, \end{aligned}

where

c_{2} = max (8 M_{0}^{2} M_{1}, 16 M_{2})

and

c_{3} = max (\frac{1}{3} M_{0} M_{1}, 1) .

Proof.

Using Bonferroni's inequality, we can easily prove that $\begin{aligned} P \{\frac{1}{\sqrt{p_{j}}} ∥\frac{1}{n} x_{j}^{T} y - E X_{j} Y∥ \geq \frac{δ}{n}\} \\ \leq P \{⋃_{k = 1}^{p_{j}} ({[\frac{1}{n} \sum_{i = 1}^{n} (x_{i j k} y_{i} - E X_{j k} Y)]}^{2} \geq \frac{δ^{2}}{n^{2}})\} \\ \leq \sum_{k = 1}^{p_{j}} P \{|\sum_{i = 1}^{n} (x_{i j k} y_{i} - E X_{j k} Y)| \geq δ\} . \end{aligned}$ Thus we need to show that $P \{|\sum_{i = 1}^{n} (x_{i j k} y_{i} - E X_{j k} Y)| \geq δ\} \leq 4 \exp {- δ^{2} / (c_{2} n + c_{3} δ)},$ for every $k = 1, 2, \dots, p_{j} .$ Recall that the support of $X_{j k}$ is $[a, b]$ for $j = 1, 2, \dots, J$ and $k = 1, 2, \dots, p_{j},$ we denote $M_{0} = max (| a |, | b |) .$ Because $y_{i} = \sum_{j = 1}^{J} x_{i j}^{T} β_{j} + ε_{i},$ we can obtain that $\begin{aligned} x_{i j k} y_{i} - E X_{j k} Y & = \{x_{i j k} (\sum_{j = 1}^{J} x_{i j}^{T} β_{j}) - E [x_{i j k} (\sum_{j = 1}^{J} x_{i j}^{T} β_{j})]\} \\ + x_{i j k} ε_{i} \\ \hat{=} S_{i j k 1} + S_{i j k 2} . \end{aligned}$ Next we bound the tails probability of $| S_{i j k 1} |$ and $| S_{i j k 2} |$ respectively. By condition (ii)–(iii), it is easy to see that $\begin{aligned} | S_{i j k 1} | & \leq M_{0} M_{1}, \\ V a r (S_{i j k 1}) & \leq M_{0}^{2} M_{1}^{2}, \\ E | S_{i j k 2} |^{m} & \leq E M_{0}^{m} | ε_{i} |^{m} \\ \leq m! E \exp M_{0} | ε_{i} | \leq M_{2} m! (m \geq 2) . \end{aligned}$ Using the Bernstein's inequality (Van der Vaart & Wellner, Citation1996, lemma 2.2.9 and lemma 2.2.11), we conclude that (A1) $\begin{aligned} P \{|\sum_{i = 1}^{n} S_{i j k 1}| > \frac{δ}{2}\} \leq 2 \exp \{- \frac{δ^{2}}{8} \frac{1}{n M_{0}^{2} M_{1}^{2} + M_{0} M_{1} δ / 6}\}, \end{aligned}$ (A1) (A2) $\begin{aligned} P \{|\sum_{i = 1}^{n} S_{i j k 2}| > \frac{δ}{2}\} \leq 2 \exp \{- \frac{δ^{2}}{8} \frac{1}{2 n M_{2} + δ / 2}\} . \end{aligned}$ (A2) Therefore, we combine the results (EquationA1(A1) $\begin{aligned} P \{|\sum_{i = 1}^{n} S_{i j k 1}| > \frac{δ}{2}\} \leq 2 \exp \{- \frac{δ^{2}}{8} \frac{1}{n M_{0}^{2} M_{1}^{2} + M_{0} M_{1} δ / 6}\}, \end{aligned}$ (A1) ) and (EquationA2(A2) $\begin{aligned} P \{|\sum_{i = 1}^{n} S_{i j k 2}| > \frac{δ}{2}\} \leq 2 \exp \{- \frac{δ^{2}}{8} \frac{1}{2 n M_{2} + δ / 2}\} . \end{aligned}$ (A2) ) with $c_{2} = max (8 M_{0}^{2} M_{1}, 16 M_{2})$ and $c_{3} = max (\frac{1}{3} M_{0} M_{1}, 1)$ , to obtain that $P \{|\sum_{i = 1}^{n} (x_{i j k} y_{i} - E X_{j k} Y)| \geq δ\} \leq 4 \exp {- δ^{2} / (c_{2} n + c_{3} δ)} .$ This concludes the proof of the lemma.

Lemma A.2

Under conditions (ii)–(iv), for any $δ > 0$ and $j = 1, 2, \dots, J$ , we have $P \{\frac{1}{p_{j}} ∥\frac{1}{n} x_{j}^{T} x_{j} - E X_{j} X_{j}^{T}∥ \geq \frac{1}{n} δ\} \leq 2 p_{j}^{2} \exp \{- \frac{δ^{2}}{c_{4} n + c_{5} δ}\}$ where $c_{4} = 2 M_{0}^{4}$ and $c_{5} = \frac{4 M_{0}^{2}}{3}$ .

Proof.

For $s = 1, 2, \dots, p_{j}, t = 1, 2, \dots, p_{j}$ , let $T_{j} = \frac{1}{n} x_{j}^{T} x_{j} - E X_{j} X_{j}^{T}$ and $T_{j}^{(s, t)}$ be the entry of $T_{j}$ . Then we can write $T_{j}^{(s, t)} = \frac{1}{n} \sum_{i = 1}^{n} (x_{i j s} x_{i j t} - E X_{j s} X_{j t}) .$

By the fact that $∥ A ∥ \leq p ∥ A ∥_{\infty}$ , we have (A3) $\begin{aligned} P \{\frac{1}{p_{j}} ∥\frac{1}{n} x_{j}^{T} x_{j} - E X_{j} X_{j}^{T}∥ \geq \frac{1}{n} δ\} \\ \leq P \{{∥\frac{1}{n} x_{j}^{T} x_{j} - E X_{j} X_{j}^{T}∥}_{\infty} \geq \frac{1}{n} δ\} \\ \leq \sum_{s = 1}^{p_{j}} \sum_{t = 1}^{p_{j}} P \{| T_{j}^{(s, t)} | \geq \frac{δ}{n}\} . \end{aligned}$ (A3) Next we also use Bernstein's inequality to bound the tails probability of $T_{j}^{(s, t)}$ . By condition (ii)–(iii), we can obtain easily that $\begin{aligned} | x_{i j s} x_{i j t} - E X_{j s} X_{j t} | & \leq 2 M_{0}^{2}, \\ V a r (x_{i j s} x_{i j t}) & \leq M_{0}^{4} . \end{aligned}$ Using Bernstein's inequality, it follows that (A4) $P \{| T_{j}^{(s, t)} | \geq \frac{δ}{n}\} \leq 2 \exp \{- \frac{1}{2} \frac{δ^{2}}{n M_{0}^{4} + 2 M_{0}^{2} δ / 3}\} .$ (A4) Thus the desired result is obtained from (EquationA3(A3) $\begin{aligned} P \{\frac{1}{p_{j}} ∥\frac{1}{n} x_{j}^{T} x_{j} - E X_{j} X_{j}^{T}∥ \geq \frac{1}{n} δ\} \\ \leq P \{{∥\frac{1}{n} x_{j}^{T} x_{j} - E X_{j} X_{j}^{T}∥}_{\infty} \geq \frac{1}{n} δ\} \\ \leq \sum_{s = 1}^{p_{j}} \sum_{t = 1}^{p_{j}} P \{| T_{j}^{(s, t)} | \geq \frac{δ}{n}\} . \end{aligned}$ (A3) ) and (EquationA4(A4) $P \{| T_{j}^{(s, t)} | \geq \frac{δ}{n}\} \leq 2 \exp \{- \frac{1}{2} \frac{δ^{2}}{n M_{0}^{4} + 2 M_{0}^{2} δ / 3}\} .$ (A4) ) by taking $c_{4} = 2 M_{0}^{4}$ and $c_{5} = \frac{4 M_{0}^{2}}{3}$ .

Remark A.1

If A and B are two symmetric matrices of order p, we have the following two results (Fan, Feng, & Song, Citation2011; He et al., Citation2013): $\begin{aligned} | λ_{min} (A) - λ_{min} (B) | \leq max {| λ_{min} (A - B) |, λ_{min} (B - A) |}, \\ | λ_{max} (A) - λ_{max} (B) | \leq max {| λ_{max} (A - B) |, λ_{max} (B - A) |} . \end{aligned}$ In addition, note that $| λ_{min} (A - B) | \leq | λ_{max} (A - B) | \leq p ∥ A - B ∥_{\infty} .$ The above results, together with Lemma 2, imply that (A5) $\begin{aligned} P \{|λ_{min} (\frac{1}{n} x_{j}^{T} x_{j}) - λ_{min} (E X_{j} X_{j}^{'})| \geq \frac{p_{j}}{n} δ\} \\ \leq 2 p_{j}^{2} \exp \{- \frac{δ^{2}}{c_{4} n + c_{5} δ}\}, \end{aligned}$ (A5) (A6) $\begin{aligned} P \{|λ_{max} (\frac{1}{n} x_{j}^{T} x_{j}) - λ_{max} (E X_{j} X_{j}^{T})| \geq \frac{p_{j}}{n} δ\} \\ \leq 2 p_{j}^{2} \exp \{- \frac{δ^{2}}{c_{4} n + c_{5} δ}\} \end{aligned}$ (A6) for $j = 1, 2, \dots, p_{j}$ .

Lemma A.3

Suppose conditions (ii)–(iv) hold, there exist some positive constants $τ_{3}$ and $τ_{4}$ such that (A7) $\begin{aligned} P \{τ_{3} \leq λ_{min} (\frac{1}{n} x_{j}^{T} x_{j}) \leq λ_{max} (\frac{1}{n} x_{j}^{T} x_{j}) \leq τ_{4}\} \\ \geq 1 - 2 p_{j}^{2} \exp \{- \frac{δ^{2}}{c_{4} n + c_{5} δ}\} . \end{aligned}$ (A7) That is, with probability approaching 1, we have $0 < τ_{3} \leq λ_{min} (\frac{1}{n} x_{j}^{T} x_{j}) \leq λ_{max} (\frac{1}{n} x_{j}^{T} x_{j}) \leq τ_{4} < \infty .$

Proof.

Combing condition (iv) and (EquationA5(A5) $\begin{aligned} P \{|λ_{min} (\frac{1}{n} x_{j}^{T} x_{j}) - λ_{min} (E X_{j} X_{j}^{'})| \geq \frac{p_{j}}{n} δ\} \\ \leq 2 p_{j}^{2} \exp \{- \frac{δ^{2}}{c_{4} n + c_{5} δ}\}, \end{aligned}$ (A5) )–(EquationA6(A6) $\begin{aligned} P \{|λ_{max} (\frac{1}{n} x_{j}^{T} x_{j}) - λ_{max} (E X_{j} X_{j}^{T})| \geq \frac{p_{j}}{n} δ\} \\ \leq 2 p_{j}^{2} \exp \{- \frac{δ^{2}}{c_{4} n + c_{5} δ}\} \end{aligned}$ (A6) ), it is easy to obtain (EquationA7(A7) $\begin{aligned} P \{τ_{3} \leq λ_{min} (\frac{1}{n} x_{j}^{T} x_{j}) \leq λ_{max} (\frac{1}{n} x_{j}^{T} x_{j}) \leq τ_{4}\} \\ \geq 1 - 2 p_{j}^{2} \exp \{- \frac{δ^{2}}{c_{4} n + c_{5} δ}\} . \end{aligned}$ (A7) ).

Appendix 2.

Proof of Theorem 3.1

Proof of Theorem 3.1.

The key idea of the proof is to show the uniform consistence of $∥ {\hat{υ}}_{n j} ∥_{n}^{2}$ under conditions (ii)–(iv). As to the existing literatures, the sure screening property is typically established in this way. Recall that $∥ {\hat{υ}}_{n j} ∥_{n}^{2} = {(\frac{1}{n} x_{j}^{T} y)}^{T} {(\frac{1}{n} x_{j}^{T} x_{j})}^{- 1} (\frac{1}{n} x_{j}^{T} y)$ and $∥ υ_{j} ∥^{2} = (E X_{j} Y)^{T} (E X_{j} X_{j}^{T})^{- 1} (E X_{j} Y) .$ Thus we need to evaluate $\begin{aligned} \frac{1}{p_{j}} ∥ {\hat{υ}}_{n j} ∥_{n}^{2} - \frac{1}{p_{j}} ∥ υ_{j} ∥^{2} & = \frac{1}{p_{j}} {(\frac{1}{n} x_{j}^{T} y)}^{T} {(\frac{1}{n} x_{j}^{T} x_{j})}^{- 1} (\frac{1}{n} x_{j}^{T} y) \\ - \frac{1}{p_{j}} (E X_{j} Y)^{T} (E X_{j} X_{j}^{T})^{- 1} (E X_{j} Y) . \end{aligned}$ By some algebra, we decompose it into three parts $∥ {\hat{υ}}_{n j} ∥_{n}^{2} - ∥ υ_{j} ∥^{2} = λ_{1} + λ_{2} + λ_{3}$ in which $\begin{aligned} λ_{1} = {(\frac{1}{n} x_{j}^{T} y - E X_{j} Y)}^{T} {(\frac{1}{n} x_{j}^{T} x_{j})}^{- 1} (\frac{1}{n} x_{j}^{T} y - E X_{j} Y), \\ λ_{2} = {(\frac{1}{n} x_{j}^{'} y - E X_{j} Y)}^{T} {(\frac{1}{n} x_{j}^{T} x_{j})}^{- 1} E X_{j} Y, \\ λ_{3} = (E X_{j} Y)^{T} {(\frac{1}{n} x_{j}^{T} x_{j})}^{- 1} (E X_{j} X_{j}^{T} \\ - \frac{1}{n} x_{j}^{T} x_{j}) (E X_{j} X_{j}^{T})^{- 1} E X_{j} Y . \end{aligned}$ Now, we define a event $Ω_{δ}$ on which we have $\begin{aligned} \frac{1}{\sqrt{p_{j}}} ∥\frac{1}{n} x_{j}^{T} y - E X_{j} Y∥ \leq \frac{δ}{n}, \\ \frac{1}{p_{j}} ∥\frac{1}{n} x_{j}^{T} x_{j} - E X_{j} X_{j}^{T}∥ \leq \frac{1}{n} δ, \\ τ_{3} \leq λ_{min} (\frac{1}{n} x_{j}^{T} x_{j}) \leq λ_{max} (\frac{1}{n} x_{j}^{T} x_{j}) \leq τ_{4} \end{aligned}$ for $j = 1, 2, \dots, J$ .

Then the above three lemmas indicate that (A8) $\begin{aligned} P (Ω_{δ}) & \geq 1 - 4 \sum_{j = 1}^{J} p_{j} \exp {- δ^{2} / (c_{2} n + c_{3} δ)} \\ - 4 \sum_{j = 1}^{J} p_{j}^{2} \exp {- δ^{2} / (c_{4} n + c_{5} δ)} . \end{aligned}$ (A8) By the fact that $∥ A B ∥ \leq ∥ A ∥ ∥ B ∥$ , we have on $Ω_{δ}$ , $\begin{aligned} \frac{1}{p_{j}} | λ_{1} | & \leq {∥\frac{1}{n} x_{j}^{T} Y - E X_{j} Y∥}^{2} ∥{(\frac{1}{n} x_{j}^{T} x_{j})}^{- 1}∥ \leq \frac{δ^{2}}{n^{2}} \frac{1}{τ_{3}}, \\ \frac{1}{p_{j}} | λ_{2} | & \leq 2 ∥\frac{1}{n} x_{j}^{T} y - E X_{j} Y∥ ∥{(\frac{1}{n} x_{j}^{T} x_{j})}^{- 1}∥ \\ ∥ E X_{j} Y ∥ & \leq \frac{δ}{n} \frac{2 M_{0} M_{1}}{τ_{3}}, \\ \frac{1}{p_{j}} | λ_{3} | & \leq ∥{(\frac{1}{n} x_{j}^{T} x_{j})}^{- 1}∥ ∥ (E X_{j} X_{j}^{T})^{- 1} ∥ ∥E X_{j} X_{j}^{T} - \frac{1}{n} x_{j}^{T} x_{j}∥ \\ ∥ E X_{j} Y ∥^{2} & \leq \frac{δ}{n} \frac{M_{0} M_{1}}{τ_{1} τ_{3}} . \end{aligned}$ Take $δ = c_{6} n^{1 - κ}$ , there exists a constant $c_{7}$ such that $\frac{δ^{2}}{n^{2}} \frac{1}{τ_{3}} + \frac{δ}{n} \frac{2 M_{0} M_{1}}{τ_{3}} + \frac{δ}{n} \frac{M_{0} M_{1}}{τ_{1} τ_{3}} \leq c_{6} c_{7} n^{- κ} .$ Choosing $c_{6}$ such that $c_{6} c_{7} \leq c$ , we can easily obtain that $|{∥\frac{1}{p_{j}} {\hat{υ}}_{n j}∥}_{n}^{2} - \frac{1}{p_{j}} ∥ υ_{j} ∥^{2}| \leq c n^{- κ} .$ By invoking condition (i), we have on $Ω_{δ}$ for sufficiently large n ${∥\frac{1}{p_{j}} {\hat{υ}}_{n j}∥}_{n}^{2} \geq 2 c n^{- κ} .$ If we choose $π_{n} \leq 2 c n^{- κ}$ , it is easy to show that $j \in M_{*} .$ This, together with (EquationA8(A8) $\begin{aligned} P (Ω_{δ}) & \geq 1 - 4 \sum_{j = 1}^{J} p_{j} \exp {- δ^{2} / (c_{2} n + c_{3} δ)} \\ - 4 \sum_{j = 1}^{J} p_{j}^{2} \exp {- δ^{2} / (c_{4} n + c_{5} δ)} . \end{aligned}$ (A8) ), indicates that there exists a constant $c_{1}$ such that $P {M_{*} \subset \hat{M_{κ}}} \geq 1 - 4 \sum_{j = 1}^{J} (p_{j} + p_{j}^{2}) \exp {- c_{1} n^{1 - 2 κ}} .$

Appendix 3.

Proof of Theorem 3.2

Proof of Theorem 3.2.

Following the similar argument of the proof of Theorem 3.1, we have on $Ω_{δ}$ $\begin{aligned} |\{1 \leq j \leq J : \frac{1}{p_{j}} ∥ {\hat{υ}}_{n j} ∥_{n}^{2} \geq 2 c n^{- κ}\}| \\ \leq |\{1 \leq j \leq J : \frac{1}{p_{j}} ∥ υ_{j} ∥^{2} \geq c n^{- κ}\}|, \end{aligned}$ where $| . |$ denotes the size of the set. This implies that $\sum_{j \in \hat{M_{κ}}} \frac{1}{p_{j}} ∥ υ_{j} ∥^{2} \geq c n^{- κ} | \hat{M_{κ}} | .$ By some algebra, it follows that $| \hat{M_{κ}} | \leq O (n^{κ} \sum_{j = 1}^{J} ∥ E X_{j} Y ∥^{2}) = O (n^{κ} ∥ E X Y ∥^{2}) .$ That is, we have (A9) $\begin{aligned} P {| \hat{M_{κ}} | & \leq O (n^{κ} ∥ E X Y ∥^{2})} \\ \geq 1 - 4 \sum_{j = 1}^{J} (p_{j} + p_{j}^{2}) \exp {- c_{1} n^{1 - 2 κ}} . \end{aligned}$ (A9) Thus the key point is to show that $∥ E X Y ∥^{2} = O (1) .$ For this purpose, we consider the following linear regression: $min_{α} E (Y - X^{T} α)^{2}$

with respect to $α \in R^{\sum_{j = 1}^{J} p_{j}}$ . By least square, we can easily obtain that $∥ E X Y ∥^{2} = {\hat{α}}^{'} [E (X^{T} X)]^{2} \hat{α} \leq λ_{max} (Σ) {\hat{α}}^{'} E (X^{T} X) \hat{α}$ in which $\hat{α}$ is the least square estimator. On the other hand, the orthogonal decomposition of least square implies that $V a r (Y) = V a r (X^{T} \hat{α}) + V a r (Y - X^{T} \hat{α})$ . Because $V a r (Y) = O (1)$ , we conclude that (A10) $∥ E X Y ∥^{2} \leq O (1) .$ (A10) Combining (EquationA9(A9) $\begin{aligned} P {| \hat{M_{κ}} | & \leq O (n^{κ} ∥ E X Y ∥^{2})} \\ \geq 1 - 4 \sum_{j = 1}^{J} (p_{j} + p_{j}^{2}) \exp {- c_{1} n^{1 - 2 κ}} . \end{aligned}$ (A9) ) and (EquationA10(A10) $∥ E X Y ∥^{2} \leq O (1) .$ (A10) ), the desired result can be easily obtained.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Order Reprints Request Corporate Permissions

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

Request Academic Permissions

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.

Download PDF

Related research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.

People also read
Recommended articles
Cited by

To cite this article:

Reference style: APA Chicago Harvard

Citation copied to clipboard

Reference styles above use APA (6th edition), Chicago (16th edition) & Harvard (10th edition)

Download citation

Download a citation file in RIS format that can be imported by citation management software including EndNote, ProCite, RefWorks and Reference Manager.

Choose format: RIS BibTex RefWorks Direct Export

Choose options: Citation Citation & abstract Citation & references

Your download is now in progress and you may close this window

Did you know that with a free Taylor & Francis Online account you can gain access to the following benefits?

Choose new content alerts to be informed about new research of interest to you
Easy remote access to your institution's subscriptions on any device, from any location
Save your searches and schedule alerts to send you new results
Export your search results into a .csv file to support your research

Have an account?
Login now Don't have an account?
Register for free

Login or register to access this feature

Have an account?
Login now Don't have an account?
Register for free

Choose new content alerts to be informed about new research of interest to you
Easy remote access to your institution's subscriptions on any device, from any location
Save your searches and schedule alerts to send you new results
Export your search results into a .csv file to support your research

Group screening for ultra-high-dimensional feature under linear model

Abstract

1. Introduction

2. Group SIS and iterative algorithm

2.1. Marginal linear regression based on grouped predictors

2.2. Iterative group-SIS algorithm

3. Theoretical properties

4. Numerical studies

4.1. Simulation results

Table 1. Simulation results of MEDIAN, TP, FP, CORRECT and ME for Example 4.1.

Table 2. Simulation results of MEDIAN, TP, FP, CORRECT and ME for Example 4.2.

Table 3. Simulation results of MEDIAN, TP, FP, CORRECT and ME for Example 4.3.

Table 4. Simulation results of MEDIAN, TP, FP, CORRECT and ME for Example 4.4.

4.2. Real example

Table 5. Results of AMS and CORRECT for colon data.

5. Concluding remarks

Disclosure statement

Notes on contributors

Yong Niu

Riquan Zhang

Jicai Liu

Huapeng Li

References

Appendix 1.

Three lemmas

Appendix 2.

Proof of Theorem 3.1

Proof of Theorem 3.1.

Appendix 3.

Proof of Theorem 3.2

Proof of Theorem 3.2.

Information for

Open access

Opportunities

Help and information

Group screening for ultra-high-dimensional feature under linear model

Abstract

1. Introduction

2. Group SIS and iterative algorithm

2.1. Marginal linear regression based on grouped predictors

2.2. Iterative group-SIS algorithm

3. Theoretical properties

4. Numerical studies

4.1. Simulation results

Table 1. Simulation results of MEDIAN, TP, FP, CORRECT and ME for Example 4.1.

Table 2. Simulation results of MEDIAN, TP, FP, CORRECT and ME for Example 4.2.

Table 3. Simulation results of MEDIAN, TP, FP, CORRECT and ME for Example 4.3.

Table 4. Simulation results of MEDIAN, TP, FP, CORRECT and ME for Example 4.4.

4.2. Real example

Table 5. Results of AMS and CORRECT for colon data.

5. Concluding remarks

Disclosure statement

Additional information

Funding

Notes on contributors

Yong Niu

Riquan Zhang

Jicai Liu

Huapeng Li

References

Appendix 1.

Three lemmas

Appendix 2.

Proof of Theorem 3.1

Proof of Theorem 3.1.

Appendix 3.

Proof of Theorem 3.2

Proof of Theorem 3.2.

Reprints and Corporate Permissions

Academic Permissions

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date