1,515
Views
0
CrossRef citations to date
0
Altmetric
Articles

Variable screening in multivariate linear regression with high-dimensional covariates

, &
Pages 241-253 | Received 01 Jan 2021, Accepted 09 Sep 2021, Published online: 06 Oct 2021

Abstract

We propose two variable selection methods in multivariate linear regression with high-dimensional covariates. The first method uses a multiple correlation coefficient to fast reduce the dimension of the relevant predictors to a moderate or low level. The second method extends the univariate forward regression of Wang [(2009). Forward regression for ultra-high dimensional variable screening. Journal of the American Statistical Association, 104(488), 1512–1524. https://doi.org/10.1198/jasa.2008.tm08516] in a unified way such that the variable selection and model estimation can be obtained simultaneously. We establish the sure screening property for both methods. Simulation and real data applications are presented to show the finite sample performance of the proposed methods in comparison with some naive method.

1. Introduction

High-dimensional multivariate regression has been widely applied in bioinformatics, chemometrics, and medical image analysis where many of the response variables are highly correlated (Cai et al., Citation2013; Ferte et al., Citation2013; Jia et al., Citation2017; Peng et al., Citation2010; Smith & Fahrmeir, Citation2007). For instance, in genetics study, we are interested in the association between correlated phenotypes (involved in biological pathways) and genotypes, as genetic effects and their possible interaction have been recognized as an important component for the genetic architecture of each complex phenotype (Yi, Citation2010). For this kind of problem, the number of covariates or explanatory variables is much larger than the number of observations or samples. Traditional methods of subset selection and stepwise procedure become infeasible when confronted with high dimensionality (Breiman, Citation1995).

Statistical methods and theories have been developed to solve this problem through various approaches such as network-based regularization method (C. Li & Li, Citation2008; Ren et al., Citation2019Citation2017), graphical model (B. Li, Chuns et al., Citation2012; Yin & Li, Citation2011), correlation-based screening (B. Li, Chuns et al., Citation2012; Song et al., Citation2016) and group lasso (Y. Li et al., Citation2015; J. Wang et al., Citation2019; Yang & Zou, Citation2015).

Variable selection methods for regression models with a univariate response have been proposed in the past. Some popular methods include the bridge regression (Frank & Friedman, Citation1993; Fu, Citation1998), LASSO (Tibshirani, Citation1996), SCAD (Fan & Li, Citation2001), LARS (Efron et al., Citation2004), elastic net (Zou & Hastie, Citation2005), adaptive LASSO (H. H. Zhang & Lu, Citation2007; Zou, Citation2006), and Dantzig selector (Candes & Tao, Citation2007; Y. Kong et al., Citation2016), among others. On the other hand, variable screening procedures have been developed to reduce the dimensionality from an ultrahigh dimension to a lower dimension which is smaller than the sample size (Fan & Lv, Citation2008; X. Kong et al., Citation2017; G. Li, Peng et al., Citation2012; H. Wang, Citation2009; Zhu et al., Citation2011).

For variable selection under multivariate regression models, one simple approach is to apply some variable selection method to univariate regression of each response separately. Such an approach may produce sub-optimal results since it does not utilize the joint information among the responses (Breiman & Friedman, Citation1997; Kim et al., Citation2009). To improve the estimation, various attempts have been made. One approach is to use dimension reduction techniques such as the reduced rank regression (Chen & Huang, Citation2012; He et al., Citation2018; Zhao et al., Citation2017) and the sliced inverse regression (Setdji & Cook, Citation2004; N. Zhang et al., Citation2019). Another approach is to use a block-structured regularization method to select a subset which can be used as predictors for all outcome variables (Obozinski et al., Citation2011; Peng et al., Citation2010; Turlach et al., Citation2005). The latter approach assumes that a covariate affects either all or none of the responses. However, this assumption may be too strong when each response variable is affected by different sets of predictors. Rothman et al. (Citation2010) proposed a penalized framework to estimate multivariate regression coefficient and covariance matrix simultaneously under 1 penalty. Lee and Liu (Citation2012) further improved Rothman et al.'s (Citation2010) work by using a weighted 1 regularization. Cai et al. (Citation2013) proposed a method to first estimate the regression coefficients in a column-wise fashion with Dantzig selector and then to estimate the precision matrix by solving a constrained 1 minimization problem.

In high-dimensional setting, most of the aforementioned multivariate regression methods use the technique of regularization to estimate the regression coefficient matrix (Obozinski et al., Citation2011; Peng et al., Citation2010; Turlach et al., Citation2005). However, a well-chosen penalty requires an efficient exploration of the correlation structure of the responses. It is reported that simultaneously estimating covariance and selecting variables via joint optimization can be numerically unstable in high-dimensional cases (Deshpande et al., Citation2019; Pecanka et al., Citation2019; Ren et al., Citation2019).

In this study, we propose two methods in parallel for variable screening and variable selection, namely the multiple correlation coefficient (MCC) screening (Section 3) and the unified forward regression (UFR) (Section 4). The first method is for dimension reduction which filters out covariates that have weak correlation with the response variables. It significantly reduces the feature space to a moderate or low dimension that covers the set of relevant predictors almost certainly. The second method is for variable selection which uses an extended forward regression (FR) (H. Wang, Citation2009) to identify all relevant predictors consistently under mild conditions. By MCC all relevant predictors are identified or screened, whereas by UFR both variable selection and model estimation are obtained. We illustrate the finite sample performance of the proposed methods in comparison with a naive method by simulation (Section 5) and a real data application (Section 6). We conclude the paper in Section 7 and defer the technical proofs in Appendix.

2. Notation and assumptions

Let y=(y1,y2,,yq) denote the q-dimensional response vector of interest. Let x=(x1,x2,,xp) denote the p-dimensional covariates or predictors. Denote the covariance matrices of y and x by Σy and Σx=(σij), respectively. Without loss of generality, assume that E(xk)=0 and var(xk)=1 for k=1,,p and that E(yj)=0 for j=1,,q. In practice, these can be achieved by standardization and centralization.

Consider the multivariate linear regression model (1) y=Bx+ϵ,(1) where B is a p×q matrix of coefficients and ϵ is the random error vector which is independent with x. For j=1,,q and k=1,,p, denote βj as the jth column vector of B and β(k) as the kth row vector of B. If β(k)0, xk is referred to as a relevant predictor.

Let F={1,,p} denote the full model of predictors. Let S={k:β(k)0} denote the true model. Denote the compliment of S by Sc. Denote the cardinalities of F and S as |F|=p and |S|=p0, respectively. Throughout, let denote the Euclidean norm of a vector.

Let {(yi,xi):i=1,,n} denote independent and identically distributed samples of (y,x). Denote Xn×p=(x1,,xn) and Yn×q=(y1,,yn). For j=1,,q, let y(j) denote the jth column of Y.

Assume that x is high dimensional with p being much larger than the sample size n (in the sense of Cai & Lv, Citation2007). Assume that the response vector is associated with only a small portion of predictors, i.e., p0/p is small and p0 is O(n) (Fan & Lv, Citation2008). This sparsity principle is frequently adopted and deemed useful in analysis.

3. Multiple correlation coefficient

We first propose to use a multiple correlation coefficient (MCC) to identify S. It is known that the multiple correlation coefficient between y and xk is defined as ρk=maxαRqcorr(αy,xk) and its square can be further expressed as (2) ρk2=E(γkyxk),(2) where γk=Σy1E(yxk) (Anderson, Citation2003, Section 12.2).

Given the standardized samples, we estimate ρk2 by (3) ρˆk2=1ni=1nγˆkyixik,(3) where γˆk=(1ni=1nyiyi)1i=1nyixik. Note that the computation of ρˆk2 is simple and fast through matrix algebra and does not involve any iteration. Then, we estimate S by SˆMCC={k:ρˆk2τ}, where τ is the threshold which determines the size of the estimated predictors. Here we adopt the threshold of Fan and Lv (Citation2008) by choosing τ=ρˆ(pdn+1)2, where ρˆ(1)2ρˆ(p)2 are the order statistics and dn=n/log(n) ( is the ceiling function), so that dn predictors with the largest values of ρˆk2 are retained. The naive correlation coefficient (NCC) method of Fan and Lv (Citation2008) estimates S by SˆNCC=j=1p{k:ρˆk,j2τj}, where ρˆk,j is the sample correlation coefficient between yj and xk and τj is determined in the same way as in MCC with respect to the jth response.

We now show that the MCC-based screening procedure has the sure screening property (i.e., the probability of selecting all true relevant predictors tends to one) and reduces the dimensionality of predictors below the sample size.

We state some assumptions first.

Assumption 3.1

Let λmin(A) and λmax(A) denote the smallest and largest eigenvalue of a positive definite matrix A, respectively. Assume that there exist two positive constants τmin<τmax such that 2τmin<λmin(Σy1)λmax(Σy1)<21τmaxand 2τmin<λmin(Σx)λmax(Σx)<21τmax.

Assumption 3.2

Assume that (i) for j=1,,q, βjCB for some positive constant CB and that (ii) for k=1,,p, βmin=minkSminj|βkj|νBnξmin for some positive constants ξmin and νB.

Assumption 3.3

Assume that there exist positive constants 0<η<41, K such that (i) n1log(pq)η, and (ii) E(etxk2)K for |t|η and all k=1,,p.

Assumption 3.1 requires the matrix X to be well behaved. Assumption 3.2 requires the smallest nonzero regression coefficient does not converge too fast. Otherwise, it cannot be consistently identified. (See Fan & Peng, Citation2004 for more discussions.) Assumption 3.3 ensures the exponential convergence rate of arbitrary order moments of x and ϵ (Cai et al., Citation2011) which is superior to the polynomial type counterpart (Ravikumar et al., Citation2010).

Theorem 3.1

Under Assumptions 3.13.3, if ρk2τ for all kS, then P(SSˆMCC)1 as n.

Theorem 3.1 reveals that for a properly chosen threshold τ, the probability that MCC detects all relevant predictors tends to one.

4. Unified forward regression

In this section, we propose a unified forward regression (UFR) for variable selection. It extends Wang's (Citation2009) forward regression method for the multivariate response case.

Let M={k1,,kt} denote a generic subset of F with |M|=t. Denote x(M)=(xk1,,xkt) and denote X(M)=(x1(M),,xn(M)) as the subset of X corresponding to M. We first describe a naive forward regression (NFR) method that combines the selected variables obtained by repeatedly applying Wang's (Citation2009) forward regression method to univariate regressions with respect to every response. The procedure is summarized as follows. Initially, set S(j)(0)= for j=1,,q. Perform forward regression with respect to the jth response by iterating the following two steps for =1,,n.

  1. For every kFS(j)(1), let Mkj(1)=S(j)(1){k}. Compute the sum square of residuals RSSkj(1)=y(j)(InH~k(1))y(j), where H~kj(1)=X(Mkj(1))(X(Mkj(1))X(Mkj(1)))1X(Mkj(1)). Let aj=argminkFS(j)(1)RSSkj(1).

  2. Update S(j)()=S(j)(1){aj}.

The solution path of NFR is obtained by {SNFR()=j=1qS(j)():=1,,n}.

Next, we propose the unified forward regression to select predictors by applying a modified forward regression algorithm that makes use of all response variables simultaneously. The procedure is modified from the previous one as follows. Initially, set S(0)=. Perform a modified forward regression by iterating the following two steps for =1,,n.

  1. For every kFS(1), let Mk(1)=S(1){k}. Compute the sum square of residuals RSSk(1)=tr{Y(InH~k(1))Y}, where H~k(1)=XMk(1)(XMk(1)XMk(1))1XMk(1). Let a=argminkFS(1)RSSk(1).

  2. Update S()=S(1){a}.

The solution path of UFR is obtained by {SUFR()=S():=1,,n}. Notice that both NFR and UFR terminate automatically after n iterations. It is seen that the UFR algorithm makes use of all response variables simultaneously by the trace operator. It has nearly one qth computation cost of NFR.

We show that the proposed UFR method also possesses the sure screening property. Also, we add a few more assumptions to facilitate the development of the theory.

Assumption 4.1

Assume that (i) x follows elliptically contoured distribution, whose density admits the form |Σx|1/2g{(xμ)Σx1(xμ)} with μ=Ex and g()>0, denoted by EC(μ,Σx,g), and that (ii) the distribution of ϵ is normal.

Assumption 4.2

There exist positive constants ξ, ξ0 and ν such that (i) log(p)νnξ, (ii) p0νnξ0, and (iii) ξ+6ξ0+12ξmin<1.

Assumption 4.3

The row vectors of B, i.e, β(k), k=1,,p, have the same ‘all-or-nothing’ structure, i.e., the entries of β(k) are either all zero or none-zero.

Usually, the normality assumption of x is imposed to facilitate theory development (Fan & Lv, Citation2008; H. Wang, Citation2009). Here in Assumption 4.1, we relax it to elliptically contoured distribution and show its sufficiency to obtain Lemma 1 of H. Wang (Citation2009) in Appendix. Assumption 4.1, together with Assumption 3.1, ensures the sparse Riesz assumption (C. Zhang & Huang, Citation2008) to derive some key inequalities in proving Theorem 4.1. Assumption 4.2 has been popularly assumed in the literature of ultra-high dimensional inference (Fan & Lv, Citation2008; H. Wang, Citation2009). It implies that the dimension of the covariates diverges to infinity at an exponential rate (Fan & Lv, Citation2008). Assumption 4.3 implies that all responses are associated with the same covariates (Turlach et al., Citation2005). It warrants the row-wise selection of B by UFR in contrast to the element-wise selection by NFR, which enables UFR to reach the sure screening property in fewer steps than NFR.

Define K=2τmaxνCB2τmin2νB4, where the factors are defined in Assumptions 3.1, 3.2 and 4.2. Applying Theorem 1 of H. Wang (Citation2009), we can readily get P(SSNFR(qKνn2ξ0+4ξmin))1, i.e., the NFR selects all relevant predictors with high probability after qKνn2ξ0+4ξmin steps for the multivariate regression setting. While the following theorem shows that the UFR can do the job in much fewer (one qth) steps.

Theorem 4.1

Under Assumptions 3.14.3, P(SSUFR(Kνn2ξ0+4ξmin))1 as n.

We adopt the BIC criteria to select the best subset of variables from a solution path (Liang et al., Citation2012; H. Wang, Citation2009). Let (4) BIC(M)=log[n1tr{Y(InH(M))Y}]+n1|M|(logn+2logp),(4) where H(M)=X(M){X(M)X(M)}1X(M). We then choose the subset of the variables from the solution path which minimizes BIC(M). The selection consistency was showed by H. Wang (Citation2009) and Sofer et al.(Citation2014).

Note that the UFR method is consistent if p=O(nα) for some α>0, while the MCC method works with log(p)=O(nα). In this sense, the MCC method can handle higher dimensional variable screening than the UFR method. Secondly, the UFR method is computationally more expensive than the MCC method as the former involves n−1 more times of matrix inversion operation than the latter. On the other hand, when p and n are of the same order, both MCC and UFR perform well in terms of coverage probability and UFR performs better in yielding a parsimonious model with high specificity in terms of model size and correct fit (defined later) as seen in simulation.

5. Simulation

We conduct numerical studies to investigate the finite sample performance of the proposed methods, i.e., MCC and UFR, in comparison with the naive correlation coefficient (NCC) method and the naive forward regression (NFR).

5.1. Models

Consider five models for generating the p-dimensional covariates x in Table , which are adopted from Examples 1 and 2 of Fan and Lv (Citation2008), Example 1 of Tibshirani (Citation1996), and Examples 4 and 5 of H. Wang (Citation2009), respectively.

Table 1. Five models.

For models 1 to 3, x follows a multivariate normal distribution with zero mean vector and covariance matrix Σx of the structure of identity, autoregressive and compound symmetry, respectively. In model 4, x is generated by xr=(zr+wr)/2 for r=1,,p0 and xr=(zr+r=1p0zr)/2 for r=p0+1,,p, where zr and wr are independent standard normal variables (H. Wang, Citation2009). Note that model 4 is a challenging case as the correlation coefficient of the relevant predictors and the response variables are much smaller than the correlation coefficient of irrelevant predictors and the response variables. The details of Σx are provided in Table . In model 5, we generate independent components of both x and ϵ to be e−1 where e is an exponential random variable with parameter one. This model is used to examine the robustness of the proposed methods against the departure from the normality assumption. Consider the number of predictors p to be 1000, 5000, and 10,000, respectively, which are all much larger than the sample sizes considered in the five models. Recall p0 is the number of relevant predictors. Denote the first p0 rows of B by B0. We generate independent entries of B0 from distributions given in the last column of Table , where N(4log(n)/n,1) is a normal random variable with mean 4log(n)/n and variance 1, Γ(2,1) denotes a random variable of gamma distribution with shape parameter 2 and scale parameter 1, and exp(9) is an exponential random variable with parameter 9. They are all independent with x. Set the remaining entries (the last pp0 rows) of B to be zero.

For the multivariate response case, the signal-to-noise ratio is given by R2=tr{var(Bx)}tr{var(y)}=tr(BΣxB)pσ2+tr(BΣxB).We chose the values of σ2 such that the signal-to-noise ratios are 30%, 60%, and 90%, respectively.

Throughout, set the number of replications N to be 1000.

5.2. Evaluation criteria

For MCC screening, we use Fan and Lv's (Citation2008) hard threshold method to retain the relevant predictors. For both NFR and UFR, we use the BIC criterion (Equation4) to determine the relevant predictors.

We adopt five measures as described in Table  to evaluate the finite sample performance of the proposed methods, where the model size (MS) is the number of the selected relevant predictors, the coverage probability (CP) measures how likely all the relevant predictors are identified, the percentage of correctly fitted (CF) measures the capability in identifying the true model correctly, the correct zero (CZ) characterizes the capability in producing sparse solution, and the incorrect zero (IZ) characterizes the method's under-fitting effects. Ideally, we wish a method to have MS close to p0, CP, CF, CZ all close to 100% and IZ close to zero.

Table 2. Measures for the finite sample performance of variable selection.

For b=1,,N, let Bˆ(b) denote the estimate of B under the bth replication. The corresponding selected model is denoted by Sˆ(b)={k:βˆ(k)(b)0,k=1,p}. The empirical MS is computed as MS=N1b=1N|Sˆ(b)| and the empirical values of the other measures are similarly computed.

5.3. Results

Tables  report the finite sample performance of the four competing methods in terms of the measures given in Table  under various numbers of covariates p and signal strength R2. We summarize the findings as follows. (i) The MCC method is uniformly superior to the NCC method with larger coverage probability (CP), better estimation of sparsity (with larger CZ and smaller IZ), as expected. (ii) As we adopted the fixed threshold procedure for MCC and NCC, these two methods produce conservatively large coverage of predictors at the cost of large model size. For the same reason, the percentage of incorrect zero is larger than the other two regression-based methods (UFR and NFR). So the resulting percentages of correctly fitted models for MCC and NCC are zero. (iii) When comparing UFR with NFR, the UFR demonstrates its superiority over NFR uniformly in all five measures across all five models (including Model 5 with the non-normal distribution). This corroborates the advantage of UFR in utilizing the correlation within responses over NFR. When comparing UFR with PWL, both methods perform comparably when the signal strength is as small as 30%. When the signal strength is as large as 60% or 90%, UFR outperforms PWL in all five measures in general. (iv) The UFR method performs inferior to the MCC method in cases of ultra-high dimensional covariates especially under lower signal strength, as pointed out earlier. For instance, in Model 1 of Table , the coverage probability of UFR reduces by 83%, while the counterpart of MCC reduces by 33% when the dimension of predictors p increases from 1000 to 10,000 at the signal strength of 30%. (v) As for the impact of the signal strength, the percentage of incorrect zeros rises under the weak signal strength cases from those under the strong signal strength cases. It is consistent with the findings for the univariate case in H. Wang (Citation2009) and Y. Li et al. (Citation2017). However, as the signal strength increases (e.g., from 30% to 90%), the percentages of coverage probability (CP) and probability of correct fit (CF) increase significantly (e.g., 61.9% to 98.3% and 28.8% to 58.8%, respectively, with p = 5000) and the percentage of incorrect zeros (IZ) drops quickly (e.g., from 53.7% to 2.35% with p = 5000) by both NFR and UFR as seen in Table . (vi) To examine the impact of the sample size, Table  reports the performance of the proposed methods under Model 1 with a number of covariates p fixed at 5000 and varying sample size n to be 100, 200, and 400, respectively. It is seen that the measures of model size (MS), coverage probability (CP), probability of correct fit (CF) and probability of incorrect zero (IZ) are sensitive to sample size. The improvement of performance is significant. For instance, when the sample size increases from n = 100 to n = 200 with signal strength R2=60%, the CP increases from 52.2% to 80.4% on average and the percentage of incorrect zero drops from 47.8% to 19.7% on average.

Table 3. Five measures of the performance of variable selection defined in Table  obtained by the four competing methods under various numbers of covariates (p) and signal-to-noise ratio (R2) for Model 1 in Table  with (n,q,p0)=(200,4,8).

Table 4. Five measures of the performance of variable selection defined in Table  obtained by the four competing methods under various numbers of covariates (p) and signal to noise ratio (R2) for Model 2 in Table  with (n,q,p0)=(75,5,3).

Table 5. Five measures of the performance of variable selection defined in Table  obtained by the four competing methods under various numbers of covariates (p) and signal-to-noise ratio (R2) for Model 3 in Table  with (n,q,p0)=(200,3,3).

Table 6. Five measures of the performance of variable selection defined in Table  obtained by the four competing methods under various numbers of covariates (p) and signal to noise ratio (R2) for Model 4 in Table  with (n,q,p0)=(300,2,5).

Table 7. Five measures of the performance of variable selection defined in Table  obtained by the four competing methods under various numbers of covariates (p), and signal-to-noise ratio (R2) for Model 5 in Table  with (n,q,p0)=(200,6,10).

Table 8. Five measures of the performance of variable selection defined in Table  obtained by the four competing methods under various sample sizes (n) and signal-to-noise ratio (R2) for Model 1 in Table  with (p,q,p0)=(5000,4,8).

In conclusion, the MCC method performs better when the dimension of covariates is ultra-high (log(p)=O(nα)) with respect to the sample size and the UFR method outperforms the MCC method when the dimension of covariates is of polynomial order (p=O(nα)).

6. Real data application

We apply the proposed methods to a real data set regarding bone mineral density (BMD) (Reppe et al., Citation2010). The data were collected from 84 post-menopausal Caucasian women aged from 50 to 86. For each subject, there are two responses, namely the body mass index and total hip z-score (a measure of how strong the bone in the hip), and 8649 gene expression levels in trans-iliacal bone biopsies served as covariates. It is known that low bone mineral density is usually related to fragile bone and osteoporosis and progressive reduction of bone strength which leads to increasing susceptibility of bone fractures (Cooper, Citation1997; Reppe et al., Citation2010). The goal of the study is to identify the genes that are related to BMD.

Table  reports the genes identified by the five competing methods. The MCC method identified 19 genes which include all 13 genes identified by NFR except gene TNK2. The PWL method identified 12 genes which all identified by NFR except PAIP1. And the UFR found 10 significant genes which are all contained in the set identified by NFR.

Table 9. Selected genes for the BMD data.

To examine the quality of variable selection of these methods, we compare the prediction mean square error (EyBSxS2) obtained by the three methods. To this end, we randomly split the data into a training set of 60 samples and a testing set of the remaining 24 samples. The average prediction mean square errors over the 100 replications for MCC, NCC, NFR and UFR are 273.7, 293.5, 271.0 and 241.6, respectively. Clearly, the UFR method is the winner. All the eight genes (ACSL3, NIPSNAP3B, DLEU2, C1ORF61, DKK1, SOST, ABCA8, and AFFX-M27830-M-at) identified by Reppe et al. (Citation2010) were selected by the four competing methods. The UFR method discovered two more genes, RNF216 and PLIN5, with the smallest prediction mean square errors. Similar to the findings in simulation, both MCC and NCC selected more genes than NFR and UFR with larger prediction error.

7. Conclusion

We propose two methods for variable screening in high-dimensional multivariate linear regression. The MCC method has the advantage of computational ease and can provide fast variable screening to obtain an accurate subset with a dimension below the ample size. The proposed UFR method has the feature of discovering all relevant predictors consistently at nearly the same computational cost as the univariate forward regression. The performance of UFR is sensitive to the dimensionality and signal strength. Our theory assumes Gaussian distribution for the response variables. The numerical study also shows the robustness of the proposed methods against non-normality. It is of interest to investigate the problem under more general non-homogeneously sparse assumption and nonlinear models.

Disclosure statement

No potential conflict of interest was reported by the author(s).

References

  • Anderson, T. (2003). An introduction to statistical multivariate analysis (3rd ed.). Wiley.
  • Bickel, P. J., & Levina, E. (2008). Regularized estimation of large covariance matrices. The Annals of Statistics, 36(1), 199–227. https://doi.org/10.1214/009053607000000758
  • Breiman, L. (1995). Better subset regression using the nonnegative garrote. Technometrics, 37(4), 373–384. https://doi.org/10.1080/00401706.1995.10484371
  • Breiman, L., & Friedman, J. H. (1997). Predicting multivariate responses in multiple linear regression. Journal of the Royal Statistical Society: Series B, 59(1), 3–54. https://doi.org/10.1111/rssb.1997.59.issue-1
  • Cai, T., Li, H., Liu, W., & Xie, J. (2013). Covariate–adjusted precision matrix estimation with an application in genetical genomics. Biometrika, 100(1), 139–156. https://doi.org/10.1093/biomet/ass058
  • Cai, T., Liu, W., & Luo, X. (2011). A constrained l1 minimization approach to sparse precision matrix estimation. Journal of the American Statistical Association, 106(494), 594–607. https://doi.org/10.1198/jasa.2011.tm10155
  • Cai, T., & Lv, J. (2007). Discussion: The Dantzig selector: Statistical estimation when p is much larger than n. Annals of Statistics, 35(6), 2365–2369. https://doi.org/10.1214/009053607000000442
  • Candes, E., & Tao, T. (2007). The Dantzig selector: Statistical estimation when p is much larger than n. Annals of Statistics, 35(6), 2313–2351. https://doi.org/10.1214/009053606000001523
  • Chen, L., & Huang, J. Z. (2012). Sparse reduced-rank regression for simultaneous dimension reduction and variable selection. Journal of the American Statistical Association, 107(500), 1533–1545. https://doi.org/10.1080/01621459.2012.734178
  • Cooper, C. (1997). The crippling consequences of fractures and their impact on quality of life. American Journal of Medicine, 103(2), 12–19. https://doi.org/10.1016/S0002-9343(97)90022-X
  • Deshpande, S., Rockova, V., & George, E. (2019). Simultaneous variable and covariance selection with the multivariate Spike- and Slab lasso. Journal of Computational and Graphical Statistics, 28(4), 921–931. https://doi.org/10.1080/10618600.2019.1593179
  • Efron, B., Hastie, T., Johnstone, I., & Tibshirani, R. (2004). Least angle regression. Annals of Statistics, 32(2), 407–499. https://doi.org/10.1214/009053604000000067
  • Fan, J., & Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456), 1348–1360. https://doi.org/10.1198/016214501753382273
  • Fan, J., & Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space (with discussion). Journal of the Royal Statistical Society: Series B, 70(5), 849–911. https://doi.org/10.1111/rssb.2008.70.issue-5
  • Fan, J., & Peng, H. (2004). Non-concave penalized likelihood with a diverging number of parameters. Annals of Statistics, 32(3), 928–961. https://doi.org/10.1214/009053604000000256
  • Fang, K.-T., Kotz, S., & Ng, K. W. (2018). Symmetric multivariate and related distributions. Chapman and Hall/CRC.
  • Ferte, C., Trister, A. D., Erich, H., & Bot, B. (2013). Impact of bioinformatic procedures in the development and translation of high-throughput molecular classifiers in oncology. Clinical Cancer Research, 19(16), 4315–4325. https://doi.org/10.1158/1078-0432.CCR-12-3937
  • Frank, L., & Friedman, J. (1993). A statistical view of some chemometrics regression tools. Technometrics, 35(2), 10–135. https://doi.org/10.1080/00401706.1993.10485033
  • Fu, W. J. (1998). Penalized regressions: The bridge versus the lasso. Journal of Computational and Graphical Statistics, 7(3), 397–416. https://doi.org/10.1080/10618600.1998.10474784
  • He, K., Lian, H., Ma, S., & Huang, J. Z. (2018). Dimensionality reduction and variable selection in multivariate varying-coefficient models with a large number of covariates. Journal of Statistical Planning and Inference, 113(522), 746–754. https://doi.org/10.1080/01621459.2017.1285774
  • Jia, B., Xu, S., Xiao, G., & Lambda, V. (2017). Learning gene regulatory networks from next generation sequencing data. Biometrics, 73(4), 1221–1230. https://doi.org/10.1111/biom.v73.4
  • Kim, S., Sohn, K., & Xing, E. (2009). A multivariate regression approach to association analysis of a quantitative trait network. Bioinformatics (Oxford, England), 25(12), 204–212. https://doi.org/10.1093/bioinformatics/btp218
  • Kong, X., Liu, Z., Yao, Y., & Zhou, W. (2017). Sure screening by ranking the canonical correlations. Test, 26(1), 46–70. https://doi.org/10.1007/s11749-016-0497-z
  • Kong, Y., Zheng, Z., & Lv, J. (2016). The constrained Dantzing selector with enhanced consistency. Journal of Machine Learning Research, 17(123), 1–22.
  • Lee, W., & Liu, Y. (2012). Simultaneous multiple response regression and inverse covariate matrix estimation via penalized Gaussian maximum likelihood. Journal of Multivariate Analysis, 111, 241–255. https://doi.org/10.1016/j.jmva.2012.03.013
  • Li, B., Chuns, H., & Zhao, H. (2012). Sparse estimation of conditional graphical models with application to gene networks. Journal of American Statistical Association, 107(497), 152–167. https://doi.org/10.1080/01621459.2011.644498
  • Li, C., & Li, H. (2008). Network-constrained regularization and variable selection for analysis of genomic data. Bioinformatics (Oxford, England), 24(9), 1175–1182. https://doi.org/10.1093/bioinformatics/btn081
  • Li, G., Peng, H., Zhang, J., & Zhu, L. (2012). Robust rank correlation based screening. Annals of Statistics, 40(3), 1846–1877. https://doi.org/10.1214/12-AOS1024
  • Li, Y., Li, G., Lian, H., & Tong, T. (2017). Profile forward regression screening for ultra-high dimensional semiparametric varying coefficient partially linear models. Journal of Multivariate Analysis, 155, 133–150. https://doi.org/10.1016/j.jmva.2016.12.006
  • Li, Y., Nan, B., & Zhu, J. (2015). Multivariate sparse group lasso for the multivariate multiple linear regression with an arbitrary group structure. Biometrics, 71(2), 354–363. https://doi.org/10.1111/biom.v71.2
  • Liang, H., Wang, H., & Tsai, C.-L. (2012). Profiled forward regression for ultrahigh dimensional variable screening in semiparametric partially linear model. Statistica Sinica, 22(2), 531–554. https://doi.org/10.5705/ss.2010.134
  • Obozinski, G., Wainwright, M. J., & Jordan, M. I. (2011). Support union recovery in high-dimensional multivariate regression. Annals of Statistics, 39(1), 1–47. https://doi.org/10.1214/09-AOS776
  • Pecanka, J., van der Vaart, A. W., & Marianne, J. (2019). Modeling association between multivariate correlated and high-dimensional sparse covariates: The adaptive SVS method. Journal of Applied Statistics, 46(5), 893–913. https://doi.org/10.1080/02664763.2018.1523377
  • Peng, J., Zhu, J., Bergamaschi, A., Han, W., Noh, D.-Y., Pollack, J. R., & Wang, P. (2010). Regularized multivariate regression for identifying master predictors with application to integrative genomics study of breast cancer. Annals of Applied Statistics, 4(1), 53–77. https://doi.org/10.1214/09-AOAS271
  • Ravikumar, P., Wainwright, M., & Lafferty, J. (2010). High-dimensional Ising model selection using l1 regularized logistic regression. Annals of Statistics, 38(3), 1287–1319. https://doi.org/10.1214/09-AOS691
  • Ren, J., Du, Y., Li, S., Ma, S., Jiang, Y., & Wu, C. (2019). Robust network-based regularization and variable selection for high-dimensional genomic data in cancer prognosis. Genetic Epidemiology, 43(3), 276–291. https://doi.org/10.1002/gepi.2018.43.issue-3
  • Ren, J., He, T., Li, Y., Liu, S., Du, Y., Jiang, Y., & Wu, C. (2017). Network-based regularization for high dimensional SNP data in the case-control study of type 2 diabetes. BMC Genetics, 18(1), 44. https://doi.org/10.1186/s12863-017-0495-5
  • Reppe, S., Refvem, H., Gautvik, V. T., Olstad, O. K., H∅vring, P. I., Reinholt, F. P., Holden, M., Frigessi, A., Jemtland, R., & Gautvik, K. M. (2010). Eight genes are highly associated with BMD variation in postmenopausal Caucasian women. Bone, 46(3), 604–612. https://doi.org/10.1016/j.bone.2009.11.007
  • Rothman, A. J., Levina, E., & Zhu, J. (2010). Sparse multivariate regression with covariance estimation. Journal of Computational and Graphical Statistics, 19(4), 947–962. https://doi.org/10.1198/jcgs.2010.09188
  • Saulis, L., & Statulevicius, V. (1991). Limit theorems for large deviations (Vol. 73). Springer Science & Business Media.
  • Setdji, C. M., & R. D. Cook (2004). K-means inverse regression. Technometrics, 46(4), 421–429. https://doi.org/10.1198/004017004000000437
  • Smith, M., & Fahrmeir, L. (2007). Spatial Bayesian variable selection with application to functional magnetic resonance imaging. Journal of American Statistical Association, 102(478), 417–431. https://doi.org/10.1198/016214506000001031
  • Sofer, T., Dicker, L., & Lin, X. (2014). Variable selection for high dimensional multivariate outcomes. Statistica Sinica, 24(4), 1633–1654. https://doi.org/10.5705/ss.2013.019
  • Song, Y., Schreier, P. J., Ramirez, D., & Hasija, T. (2016). Canonical correlation analysis of high-dimensional data with very small sample support. Signal Processing, 128, 449–458. https://doi.org/10.1016/j.sigpro.2016.05.020
  • Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B, 58(1), 267–288. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x
  • Turlach, B., Venables, W., & Wright, S. (2005). Simultaneous variable selection. Technometrics, 47(3), 349–363. https://doi.org/10.1198/004017005000000139
  • Wang, H. (2009). Forward regression for ultra-high dimensional variable screening. Journal of the American Statistical Association, 104(488), 1512–1524. https://doi.org/10.1198/jasa.2008.tm08516
  • Wang, J., Zhang, Z., & Ye, J. (2019). Two-layer feature reduction for sparse-group lasso via decomposition of convex sets. Journal of Machine Learning, 20(163), 1–42.
  • Yang, Y., & Zou, H. (2015). A fast unified algorithm for solving group-lasso penalize learning problems. Statistics and Computing, 25(6), 1129–1141. https://doi.org/10.1007/s11222-014-9498-5
  • Yi, N. (2010). Statistical analysis of genetic interactions. Genetics Research, 92(5–6), 443–459. https://doi.org/10.1017/S0016672310000595
  • Yin, J., & Li, H. (2011). A sparse conditional Gaussian graphical model for analysis of genetical genomics data. Annals of Applied Statistics, 5(4), 2630–2650. https://doi.org/10.1214/11-AOAS494
  • Zhang, C., & Huang, J. (2008). The sparsity and bias of the lasso selection in high-dimensional linear regression. Annals of Statistics, 36(4), 1567–1594. https://doi.org/10.1214/07-AOS520
  • Zhang, H. H., & Lu, W. (2007). Adaptive lasso for Cox's proportional hazards model. Biometrika, 94(3), 691–703. https://doi.org/10.1093/biomet/asm037
  • Zhang, N., Yu, Z., & Wu, Q. (2019). Overlapping sliced inverse regression for dimension reduction. Analysis and Applications, 17(5), 715–736. https://doi.org/10.1142/S0219530519400013
  • Zhao, W., Lian, H., & Ma, S. (2017). Robust reduced–rank modeling via rank regression. Journal of Statistical Planning and Inference, 180, 1–12. https://doi.org/10.1016/j.jspi.2016.08.009
  • Zhu, L., Li, L., Li, R., & Zhu, L. (2011). Model-free feature screening for ultrahigh-dimensional data. Journal of the American Statistical Association, 106(496), 1464–1475. https://doi.org/10.1198/jasa.2011.tm10563
  • Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101(476), 1418–1429. https://doi.org/10.1198/016214506000000735
  • Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B, 67(2), 301–320. https://doi.org/10.1111/rssb.2005.67.issue-2

Appendix

A.1. Proof of Theorem 3.1

Let CK=τ+η1K and ϑ=3η1CKnlog(pq). When ρk2τ for all kS, it suffices to show that under Assumptions 3.1–3.3 (A1) P(max1kp|ρˆk2ρk2|ϑ)=P(max1kp|1ni=1nγˆkyixikEγkyxk|ϑ)<3pτqτ.(A1) Observe that (A2) |ρˆk2ρk2|=|(γˆkγk)1ni=1nyixik+γk(1ni=1nyixikEyxk)||(γˆkγk)1ni=1nyixik|+|γk(1ni=1nyixikEyxk)|A1+A2.(A2) First, we compute the rate of convergence for A2. For k=1,,p, let ωk=n1i=1nγk(yixikEyxk). (So A2=|ωk|.) Let t1=ηn1log(pq). Applying the inequalities P(|U|V)et1VEet1|U| for any V >0 and |es1s|s2emax(s,0) for any sR, we obtain (A3) P(A2η1CKnlog(pq))eCKlog(pq)Eexp(t1|ωk|)exp{CKlog(pq)+t12Eωk2exp(t1|ωk|)}exp{CKlog(pq)+η1Klog(pq)}=pτqτ.(A3) Second, we compute the rate of convergence for A1. Notice that (A4) |(γˆkγk)1ni=1nyixik||γˆk1ni=1nyixik|+|γk1ni=1nyixik|.(A4) Parallel to (EquationA3), we have (A5) P(|γˆk1ni=1nyixik|η1CKnlog(pq))eCKlog(pq)Eexp{t1(|γˆk1ni=1nyixik|)}exp{CKlog(pq)+t12E(γˆk1ni=1nyixik)2×exp(t1|γˆk1ni=1nyixik|)}exp{CKlog(pq)+η1Klog(pq)}=pτqτ.(A5) Similar to (EquationA5), (A6) P(|γk1ni=1nyixik|η1CKnlog(pq))pτqτ.(A6) Combining (EquationA4), (EquationA5) and (EquationA6), (A7) P(A1=|(γˆkγk)1ni=1nyixik|2η1CKnlog(pq))2pτqτ.(A7) At last, combining (EquationA3) and (EquationA7), we get P(A1+A23η1CKnlog(pq))3pτqτ,which, coupled with (EquationA2), implies (EquationA1). This completes the proof.

A.2. A lemma used to prove theorem 4.1

Lemma A.1

Suppose Assumptions 3.1, 4.1 and 4.2 hold. Let Σx(M) denote the submatrix of Σx after M. Let Σˆx(M) and Σˆx denote the corresponding estimators, respectively. Suppose m=O(n2ξ0+4ξmin), where ξ0 and ξmin are defined in Assumption 4.2. Then, with probability tending to one, we have (A8) τminmin|M|mλmin(Σˆx(M))max|M|mλmax(Σˆx(M))τmax.(A8)

Proof.

The proof is similar to that for Lemma 1 of H. Wang (Citation2009). Here, we relax the normality assumption of x to the elliptically contoured distribution.

First, for i=1,,n, j,k=1,,p, let Ui=(zij+zik)/2(1+ρjk) and Vi=(zijzik)/2(1ρjk). By Assumption 4.1 and the additive property of elliptical contoured distribution (Fang et al., Citation2018), (zij,zik)EC2(0,0,1,1,ρjk,g), Uii.i.d.EC(0,1,g) and Vii.i.d.EC(0,1,g).

Second, observe that i=1n(zijzikρjk)=14[i=1n{(zij+zik)22(1+ρjk)}i=1n{(zijzik)22(1ρjk)}].Then, following Lemma A.3 of Bickel and Levina (Citation2008), we have (A9) P(|σˆijσij|nv)=P(|i=1n(xijxikσjk)|nv)=P(|i=1n(zijzikρjk)|nv(σjjσkk)1/2)=P[|i=1n{(zij+zik)22(1+ρjk)}i=1n{(zijzik)22(1ρjk)}|4nv(σjjσkk)1/2]P[|i=1n{(zij+zik)22(1+ρjk)}|+|i=1n{(zijzik)22(1ρjk)}|4nv(σjjσkk)1/2]P[|i=1n{(zij+zik)22(1+ρjk)}|4nv(σjjσkk)1/2]+P[|i=1n{(zijzik)22(1ρjk)}|4nv(σjjσkk)1/2]=P{|i=1n(Ui21)|2nv(1+ρjk)(σjjσkk)1/2}+P{|i=1n(Vi21)|2nv(1ρjk)(σjjσkk)1/2}.(A9) Further, for i=1,,n, let Wi=Ui21 and Bn2=i=1nvar(Wi). Observe that by Jensen inequality there exist positive constants c1,,cn, C3 and C4 such that |ln{Eexp(ζWi)}ζ2|ci2 for |ζ|<C3andlim¯n1Bn2i=1ncn2C4,satisfying condition (P) of Saulis and Statulevicius (Citation1991). The same result holds when Wi=Vi21. By Theorem 3.2 of Saulis and Statulevicius (Citation1991), the first and second terms of (EquationA9) are bounded, respectively, by 2exp{2nv2(1+ρjk)2(σjjσkk)+2v(1+ρjk)(σjjσkk)1/2}and 2exp{2nv2(1ρjk)2(σjjσkk)+2v(1ρjk)(σjjσkk)1/2}.Therefore, P(|σˆijσij|nv)4max[exp{2nv2(1+ρjk)2(σjjσkk)+2v(1+ρjk)(σjjσkk)1/2},×2nv2(1+ρjk)2(σjjσkk)+2v(1+ρjk)(σjjσkk)1/2exp{2nv2(1ρjk)2(σjjσkk)+2v(1ρjk)(σjjσkk)1/2}].

This, together with λmax(Σx)<21τmax, implies that (A10) P(|σˆijσij|nv)C1exp(C2nv2)for|v|δ,(A10) where the positive constants C1, C2 and δ all depend on τmax alone (Bickel & Levina, Citation2008, Lemma A.3).

The rest of the proof follows exactly the same as that for Lemma 1 of H. Wang (Citation2009).

A.3. Proof of theorem 4.1

Our proof follows similar arguments as in the proof of Theorem 1 of H. Wang (Citation2009).

Assume that no relevant predictor has been discovered in the first ℓ iterations, i.e., SS(). We evaluate the probability that at least one relevant will be identified in the (+1)'s iteration or equivalently its complementary probability that the predictor selected by the (+1)'s iteration is still an irrelevant one.

Let X(S)=(x1(S),,xn(S)) denote the subset of X corresponding to S. Let B(S) denote the coefficient matrix under the true model.

Denote H(S())=X(S()){X(S())X(S())}1X(S()),H~k()=XMk()(XMk()XMk())1XMk(),xk()=(InHS())xk,Hk()=xk()xk()xk()2.Observe that (A11) Ω()=RSS(S())RSS(S(+1))=tr{Y(InH~k())Y}tr{Y(InH~k(+1))Y}=tr{Y(H~k(+1)H~k())Y}=tr{Y(InH(S()))Ha+1()Ha+1()(InH(S()))Y}.(A11) Assume a+1S. We have (A12) Ω()maxjStr{Y(InH(S()))Hk()Hk()(InH(S()))Y}tr{Y(InH(S()))Hkˆ()Hkˆ()(InH(S()))Y},(A12) where kˆ=argmaxkStr{B(S)X(S)(InH(S()))Hk()Hk()(InH(S()))X(S)B(S)}.Further, observe that the last inequality of (EquationA12) is no less than (A13) tr{B(S)X(S)(InH(S()))Hkˆ()Hkˆ()(InH(S()))X(S)B(S)}tr{ϵ(InH(S()))Hkˆ()Hkˆ()(InH(S()))ϵ}maxkStr{B(S)X(S)(InH(S()))Hk()Hk()(InH(S()))X(S)B(S)}maxkStr{ϵ(InH(S()))Hk()Hk()(InH(S()))ϵ},(A13)

where ϵ=(ϵ1,,ϵn)Rn×q.

In what follows, we study the two terms in (EquationA13) separately.

Step 1: The first term of (EquationA13). Define Q(S(k))=InH(S(k)). And denote xk()=xkQ(S()). Then, the first term in (EquationA13) can be expressed as (A14) maxkStr{B(S)X(S)Q(S())Hk()Hk()Q(S())X(S)B(S)}=maxjStr{B(S)X(S)Q(S())xk()xk()xk()xk()xk()4×Q(S())X(S)B(S)43}=maxkStr{B(S)X(S)Q(S())xk()xk()xk()2Q(S())X(S)B(S)}.(A14) Define k=argmaxkStr{(X(S)B(S))Q(S())xk()xk()Q(S())X(S)B(S)Q(S())}.Thus, the RHS of (EquationA14) is no less than (A15) xk()2tr{B(S)X(S)Q(S())xk()xk()Q(S())X(S)B(S)}minjSxk()2×tr{B(S)X(S)Q(S())xk()xk()Q(S())X(S)B(S)}=(maxkSxk()2)1×maxkStr{B(S)X(S)Q(S())xk()xk()Q(S())X(S)B(S)}(maxkSxk2)1×maxkStr{B(S)X(S)Q(S())xk()xk()Q(S())X(S)B(S)},(A15) where the last inequality follows after the fact xkxk().

On the other hand, observe that (A16) tr{B(S)X(S)Q(S())Q(S())X(S)B(S)}=tr{B(S)X(S)Q(S())X(S)B(S)}=j=1q(kSβkjxkQ(S())X(S)βj)j=1q(kSβkj2)1/2{kS(xkQ(S())X(S)βj)2}1/2q×maxjβjtr{B(S)X(S)Q(S())xk()xk()Q(S())X(S)B(S)}1/2×p0.(A16) Applying (EquationA16) to (EquationA15) and (EquationA14), using the fact that maxkSxk2/nτmax with probability tending to one, and by Assumptions 3.1, 4.1 and Lemma 1 of H. Wang (Citation2009), we obtain (A17) maxkStr{B(S)X(S)Q(S())Hk()Hk()Q(S())X(S)B(S)}[tr{B(S)X(S)Q(S())X(S)B(S)}]2q2nτmaxp0×maxjβj2.(A17) Define ζ(S())=(X(S())X(S()))1X(S)B(S). We have tr{B(S)X(S)Q(S())X(S)B(S)}=tr(B(S)X(S)X(S)B(S)ζ(S())X(S())X(S())ζ(S())).Recall the assumption at the beginning of the proof that SS(k). Then, by Assumptions 3.1, 3.2 and 4.1, and Lemma 1 of H. Wang (Citation2009), we get (A18) tr{B(S)X(S)Q(S())X(S)B(S)}qnτminβmin2(A18) with probability tending to one. Applying (EquationA18) to (EquationA17) and using Assumptions 3.2, 4.1 and 4.2, we obtain (A19) maxkStr{B(S)X(S)Q(S())Hk()Hk()Q(S())X(S)B(S)}nτmax1p01(maxjβj)2τmin2βmin4τmax1ν1CB2τmin2νB4n1ξ04ξmin.(A19) Step 2: The second term of (EquationA13). Recall xk()=xkH(S())xk=xkX(S())θk(S())where θk(S())=(X(S())X(S()))1X(S())xk.Then, by Assumptions 3.1 and 3.2 and Lemma 1 of H. Wang (Citation2009), we have xk()2nτmin.

Moreover, since xk()=(InH(S()))xk, we have (A20) tr{ϵ(InH(S()))Hk()Hk()(InH(S()))ϵ}=xk()2tr(ϵxkxkϵϵH(S())xkxkH(S())ϵ)τmin1n1tr(ϵxkxkϵϵH(S())xkxkH(S())ϵ)=τmin1n1tr(ϵQ(S())xkxkQ(S())ϵ)τmin1n1maxkSmax|M|mtr(ϵQ(M)xkxkQ(M)ϵ)=τmin1n1maxkSmax|M|mj=1q(ϵjQ(M)xkxkQ(M)ϵj),(A20) where m=Kνn2ξ0+4ξmin.

Notice that xkQ(M)ϵj is a normal random variable with mean zero and variance given by Q(M)xk2xk2. Thus, the RHS of (EquationA20) is no greater than (A21) qτmin1n1maxkSxk2maxkSmax|M|mχ12,(A21) where χ12 stands for a chi-square random variable with one degree of freedom. By Assumptions 3.1 and 4.1, and Lemma 1 of H. Wang (Citation2009), we get that n1maxkSxk2τmax with probability tending to one.

On the other hand, the total number of combinations for kS and |M|m is no more than dm+2. Then, by Assumption 4.2, we get maxkSmax|M|=χ122(m+2)log(p)3Kνn2ξ0+4ξmin×νnξ=3Kν2nξ+2ξ0+4ξminwith probability tending to one. Therefore, (EquationA21) is bounded by (A22) qτmin1τmax3Kν2nξ+2ξ0+4ξmin1.(A22) Combining (EquationA13), (EquationA19) and (EquationA22), we have (A23) n1Ω()τmax1ν1CB2τmin2νB4n1ξ04ξminqτmin1τmax3Kν2nξ+2ξ0+4ξmin1=τmax1ν1CB2τmin2νB4n1ξ04ξmin×(1qτmax2ν3CB2τmax3νB43Knξ+3ξ0+8ξmin1)(A23) uniformly for every Knξ0+4ξmin. Recall K=2τmaxνCB2τmin2νB4 defined in Section 4. Then, by Assumption 4.2 and 4.3, we have (A24) n1YF2n1=1Knξ0+4ξminΩ()2(1qτmax2ν3CB2τmax3νB43Knξ+3ξ0+8ξmin1)p2,(A24) where F is the Frobenius norm.

Without loss of generality, we can assume var(yi1)++var(yiq)=1, and we have n1YF2p1. (Otherwise, we can standardize it by letting yik=yik/var(yi1)++var(yiq) for i=1,,n and k=1,,q.) Thus, it is impossible to have S(k)Mt= for every 1kKnξ0+4ξmin, which implies that at least one relevant variable will be discovered within Knξ0+4ξmin steps. This completes the proof.