Full article: Variable screening in multivariate linear regression with high-dimensional covariates

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

We propose two variable selection methods in multivariate linear regression with high-dimensional covariates. The first method uses a multiple correlation coefficient to fast reduce the dimension of the relevant predictors to a moderate or low level. The second method extends the univariate forward regression of Wang [(2009). Forward regression for ultra-high dimensional variable screening. Journal of the American Statistical Association, 104(488), 1512–1524. https://doi.org/10.1198/jasa.2008.tm08516] in a unified way such that the variable selection and model estimation can be obtained simultaneously. We establish the sure screening property for both methods. Simulation and real data applications are presented to show the finite sample performance of the proposed methods in comparison with some naive method.

Keywords:

1. Introduction

High-dimensional multivariate regression has been widely applied in bioinformatics, chemometrics, and medical image analysis where many of the response variables are highly correlated (Cai et al., Citation2013; Ferte et al., Citation2013; Jia et al., Citation2017; Peng et al., Citation2010; Smith & Fahrmeir, Citation2007). For instance, in genetics study, we are interested in the association between correlated phenotypes (involved in biological pathways) and genotypes, as genetic effects and their possible interaction have been recognized as an important component for the genetic architecture of each complex phenotype (Yi, Citation2010). For this kind of problem, the number of covariates or explanatory variables is much larger than the number of observations or samples. Traditional methods of subset selection and stepwise procedure become infeasible when confronted with high dimensionality (Breiman, Citation1995).

Statistical methods and theories have been developed to solve this problem through various approaches such as network-based regularization method (C. Li & Li, Citation2008; Ren et al., Citation2019, Citation2017), graphical model (B. Li, Chuns et al., Citation2012; Yin & Li, Citation2011), correlation-based screening (B. Li, Chuns et al., Citation2012; Song et al., Citation2016) and group lasso (Y. Li et al., Citation2015; J. Wang et al., Citation2019; Yang & Zou, Citation2015).

Variable selection methods for regression models with a univariate response have been proposed in the past. Some popular methods include the bridge regression (Frank & Friedman, Citation1993; Fu, Citation1998), LASSO (Tibshirani, Citation1996), SCAD (Fan & Li, Citation2001), LARS (Efron et al., Citation2004), elastic net (Zou & Hastie, Citation2005), adaptive LASSO (H. H. Zhang & Lu, Citation2007; Zou, Citation2006), and Dantzig selector (Candes & Tao, Citation2007; Y. Kong et al., Citation2016), among others. On the other hand, variable screening procedures have been developed to reduce the dimensionality from an ultrahigh dimension to a lower dimension which is smaller than the sample size (Fan & Lv, Citation2008; X. Kong et al., Citation2017; G. Li, Peng et al., Citation2012; H. Wang, Citation2009; Zhu et al., Citation2011).

For variable selection under multivariate regression models, one simple approach is to apply some variable selection method to univariate regression of each response separately. Such an approach may produce sub-optimal results since it does not utilize the joint information among the responses (Breiman & Friedman, Citation1997; Kim et al., Citation2009). To improve the estimation, various attempts have been made. One approach is to use dimension reduction techniques such as the reduced rank regression (Chen & Huang, Citation2012; He et al., Citation2018; Zhao et al., Citation2017) and the sliced inverse regression (Setdji & Cook, Citation2004; N. Zhang et al., Citation2019). Another approach is to use a block-structured regularization method to select a subset which can be used as predictors for all outcome variables (Obozinski et al., Citation2011; Peng et al., Citation2010; Turlach et al., Citation2005). The latter approach assumes that a covariate affects either all or none of the responses. However, this assumption may be too strong when each response variable is affected by different sets of predictors. Rothman et al. (Citation2010) proposed a penalized framework to estimate multivariate regression coefficient and covariance matrix simultaneously under $ℓ_{1}$ penalty. Lee and Liu (Citation2012) further improved Rothman et al.'s (Citation2010) work by using a weighted $ℓ_{1}$ regularization. Cai et al. (Citation2013) proposed a method to first estimate the regression coefficients in a column-wise fashion with Dantzig selector and then to estimate the precision matrix by solving a constrained $ℓ_{1}$ minimization problem.

In high-dimensional setting, most of the aforementioned multivariate regression methods use the technique of regularization to estimate the regression coefficient matrix (Obozinski et al., Citation2011; Peng et al., Citation2010; Turlach et al., Citation2005). However, a well-chosen penalty requires an efficient exploration of the correlation structure of the responses. It is reported that simultaneously estimating covariance and selecting variables via joint optimization can be numerically unstable in high-dimensional cases (Deshpande et al., Citation2019; Pecanka et al., Citation2019; Ren et al., Citation2019).

In this study, we propose two methods in parallel for variable screening and variable selection, namely the multiple correlation coefficient (MCC) screening (Section 3) and the unified forward regression (UFR) (Section 4). The first method is for dimension reduction which filters out covariates that have weak correlation with the response variables. It significantly reduces the feature space to a moderate or low dimension that covers the set of relevant predictors almost certainly. The second method is for variable selection which uses an extended forward regression (FR) (H. Wang, Citation2009) to identify all relevant predictors consistently under mild conditions. By MCC all relevant predictors are identified or screened, whereas by UFR both variable selection and model estimation are obtained. We illustrate the finite sample performance of the proposed methods in comparison with a naive method by simulation (Section 5) and a real data application (Section 6). We conclude the paper in Section 7 and defer the technical proofs in Appendix.

2. Notation and assumptions

Let $y = (y_{1}, y_{2}, \dots, y_{q})^{⊤}$ denote the q-dimensional response vector of interest. Let $x = (x_{1}, x_{2}, \dots, x_{p})^{⊤}$ denote the p-dimensional covariates or predictors. Denote the covariance matrices of $y$ and $x$ by $Σ_{y}$ and $Σ_{x} = (σ_{i j})$ , respectively. Without loss of generality, assume that $E (x_{k}) = 0$ and $var (x_{k}) = 1$ for $k = 1, \dots, p$ and that $E (y_{j}) = 0$ for $j = 1, \dots, q$ . In practice, these can be achieved by standardization and centralization.

Consider the multivariate linear regression model (1) $y = B^{⊤} x + \vec{ϵ},$ (1) where B is a $p \times q$ matrix of coefficients and $\vec{ϵ}$ is the random error vector which is independent with $x$ . For $j = 1, \dots, q$ and $k = 1, \dots, p$ , denote ${\vec{β}}_{j}$ as the jth column vector of B and ${\vec{β}}_{(k)}$ as the kth row vector of B. If ${\vec{β}}_{(k)} \neq 0$ , $x_{k}$ is referred to as a relevant predictor.

Let $F = {1, \dots, p}$ denote the full model of predictors. Let $S = {k : {\vec{β}}_{(k)} \neq 0}$ denote the true model. Denote the compliment of S by $S^{c}$ . Denote the cardinalities of F and S as $| F | = p$ and $| S | = p_{0}$ , respectively. Throughout, let $‖ \cdot ‖$ denote the Euclidean norm of a vector.

Let ${(y_{i}, x_{i}) : i = 1, \dots, n}$ denote independent and identically distributed samples of $(y, x)$ . Denote $X_{n \times p} = (x_{1}, \dots, x_{n})^{⊤}$ and $Y_{n \times q} = (y_{1}, \dots, y_{n})^{⊤}$ . For $j = 1, \dots, q$ , let $y_{(j)}$ denote the jth column of Y.

Assume that $x$ is high dimensional with p being much larger than the sample size n (in the sense of Cai & Lv, Citation2007). Assume that the response vector is associated with only a small portion of predictors, i.e., $p_{0} / p$ is small and $p_{0}$ is $O (n)$ (Fan & Lv, Citation2008). This sparsity principle is frequently adopted and deemed useful in analysis.

3. Multiple correlation coefficient

We first propose to use a multiple correlation coefficient (MCC) to identify S. It is known that the multiple correlation coefficient between $y$ and $x_{k}$ is defined as $ρ_{k} = {max}_{\vec{α} \in R^{q}} c o r r ({\vec{α}}^{⊤} y, x_{k})$ and its square can be further expressed as (2) $ρ_{k}^{2} = E ({\vec{γ}}_{k}^{⊤} y x_{k}),$ (2) where ${\vec{γ}}_{k} = Σ_{y}^{- 1} E (y x_{k})$ (Anderson, Citation2003, Section 12.2).

Given the standardized samples, we estimate $ρ_{k}^{2}$ by (3) ${\hat{ρ}}_{k}^{2} = \frac{1}{n} \sum_{i = 1}^{n} {\hat{\vec{γ}}}_{k}^{⊤} y_{i} x_{i k},$ (3) where ${\hat{\vec{γ}}}_{k} = (\frac{1}{n} \sum_{i = 1}^{n} y_{i} y_{i}^{⊤})^{- 1} \sum_{i = 1}^{n} y_{i} x_{i k}$ . Note that the computation of ${\hat{ρ}}_{k}^{2}$ is simple and fast through matrix algebra and does not involve any iteration. Then, we estimate S by ${\hat{S}}_{M C C} = {k : {\hat{ρ}}_{k}^{2} \geq τ}$ , where τ is the threshold which determines the size of the estimated predictors. Here we adopt the threshold of Fan and Lv (Citation2008) by choosing $τ = {\hat{ρ}}_{(p - d_{n} + 1)}^{2}$ , where ${\hat{ρ}}_{(1)}^{2} \leq \dots \leq {\hat{ρ}}_{(p)}^{2}$ are the order statistics and $d_{n} = ⌈ n / \log (n) ⌉$ ( $⌈ \cdot ⌉$ is the ceiling function), so that $d_{n}$ predictors with the largest values of ${\hat{ρ}}_{k}^{2}$ are retained. The naive correlation coefficient (NCC) method of Fan and Lv (Citation2008) estimates S by ${\hat{S}}_{N C C} = \cup_{j = 1}^{p} {k : {\hat{ρ}}_{k, j}^{2} \geq τ_{j}},$ where ${\hat{ρ}}_{k, j}$ is the sample correlation coefficient between $y_{j}$ and $x_{k}$ and $τ_{j}$ is determined in the same way as in MCC with respect to the jth response.

We now show that the MCC-based screening procedure has the sure screening property (i.e., the probability of selecting all true relevant predictors tends to one) and reduces the dimensionality of predictors below the sample size.

We state some assumptions first.

Assumption 3.1

Let $λ_{min} (A)$ and $λ_{max} (A)$ denote the smallest and largest eigenvalue of a positive definite matrix A, respectively. Assume that there exist two positive constants $τ_{min} < τ_{max}$ such that $2 τ_{min} < λ_{min} (Σ_{y}^{- 1}) \leq λ_{max} (Σ_{y}^{- 1}) < 2^{- 1} τ_{max}$ and $2 τ_{min} < λ_{min} (Σ_{x}) \leq λ_{max} (Σ_{x}) < 2^{- 1} τ_{max} .$

Assumption 3.2

Assume that (i) for $j = 1, \dots, q$ , $‖ {\vec{β}}_{j} ‖ \leq C_{B}$ for some positive constant $C_{B}$ and that (ii) for $k = 1, \dots, p$ , $β_{min} = min_{k \in S} min_{j} | β_{k j} | \geq ν_{B} n^{- ξ_{min}}$ for some positive constants $ξ_{min}$ and $ν_{B}$ .

Assumption 3.3

Assume that there exist positive constants $0 < η < 4^{- 1}$ , K such that (i) $n^{- 1} \log (p q) \leq η$ , and (ii) $E (e^{t x_{k}^{2}}) \leq K$ for $| t | \leq η$ and all $k = 1, \dots, p$ .

Assumption 3.1 requires the matrix X to be well behaved. Assumption 3.2 requires the smallest nonzero regression coefficient does not converge too fast. Otherwise, it cannot be consistently identified. (See Fan & Peng, Citation2004 for more discussions.) Assumption 3.3 ensures the exponential convergence rate of arbitrary order moments of $x$ and $\vec{ϵ}$ (Cai et al., Citation2011) which is superior to the polynomial type counterpart (Ravikumar et al., Citation2010).

Theorem 3.1

Under Assumptions 3.1–3.3, if $ρ_{k}^{2} \geq τ$ for all $k \in S$ , then $P (S \subset {\hat{S}}_{M C C}) \to 1$ as $n \to \infty$ .

Theorem 3.1 reveals that for a properly chosen threshold τ, the probability that MCC detects all relevant predictors tends to one.

4. Unified forward regression

In this section, we propose a unified forward regression (UFR) for variable selection. It extends Wang's (Citation2009) forward regression method for the multivariate response case.

Let $M = {k_{1}, \dots, k_{t}}$ denote a generic subset of F with $| M | = t$ . Denote $x_{(M)} = (x_{k_{1}}, \dots, x_{k_{t}})^{⊤}$ and denote $X_{(M)} = (x_{1 (M)}, \dots, x_{n (M)})^{⊤}$ as the subset of X corresponding to M. We first describe a naive forward regression (NFR) method that combines the selected variables obtained by repeatedly applying Wang's (Citation2009) forward regression method to univariate regressions with respect to every response. The procedure is summarized as follows. Initially, set $S_{(j)}^{(0)} = \emptyset$ for $j = 1, \dots, q$ . Perform forward regression with respect to the jth response by iterating the following two steps for $ℓ = 1, \dots, n$ .

For every $k \in F ∖ S_{(j)}^{(ℓ - 1)}$ , let $M_{k j}^{(ℓ - 1)} = S_{(j)}^{(ℓ - 1)} ⋃ {k}$ . Compute the sum square of residuals ${R S S}_{k j}^{(ℓ - 1)} = y_{(j)}^{⊤} (I_{n} - {\tilde{H}}_{k}^{(ℓ - 1)}) y_{(j)}$ , where ${\tilde{H}}_{k j}^{(ℓ - 1)} = X_{(M_{k j}^{(ℓ - 1)})} (X_{(M_{k j}^{(ℓ - 1)})}^{⊤} X_{(M_{k j}^{(ℓ - 1)})})^{- 1} X_{(M_{k j}^{(ℓ - 1)})}^{⊤}$ . Let $a_{ℓ j} = {a r g m i n}_{k \in F ∖ S_{(j)}^{(ℓ - 1)}} {R S S}_{k j}^{(ℓ - 1)}$ .
Update $S_{(j)}^{(ℓ)} = S_{(j)}^{(ℓ - 1)} ⋃ {a_{ℓ j}}$ .

The solution path of NFR is obtained by ${S_{N F R}^{(ℓ)} = \cup_{j = 1}^{q} S_{(j)}^{(ℓ)} : ℓ = 1, \dots, n}$ .

Next, we propose the unified forward regression to select predictors by applying a modified forward regression algorithm that makes use of all response variables simultaneously. The procedure is modified from the previous one as follows. Initially, set $S^{(0)} = \emptyset$ . Perform a modified forward regression by iterating the following two steps for $ℓ = 1, \dots, n$ .

For every $k \in F ∖ S^{(ℓ - 1)}$ , let $M_{k}^{(ℓ - 1)} = S^{(ℓ - 1)} ⋃ {k}$ . Compute the sum square of residuals ${R S S}_{k}^{(ℓ - 1)} = t r {Y^{⊤} (I_{n} - {\tilde{H}}_{k}^{(ℓ - 1)}) Y}$ , where ${\tilde{H}}_{k}^{(ℓ - 1)} = X_{M_{k}^{(ℓ - 1)}} (X_{M_{k}^{(ℓ - 1)}}^{⊤} X_{M_{k}^{(ℓ - 1)}})^{- 1} X_{M_{k}^{(ℓ - 1)}}^{⊤}$ . Let $a_{ℓ} = {a r g m i n}_{k \in F ∖ S^{(ℓ - 1)}} {R S S}_{k}^{(ℓ - 1)}$ .
Update $S^{(ℓ)} = S^{(ℓ - 1)} ⋃ {a_{ℓ}}$ .

The solution path of UFR is obtained by ${S_{U F R}^{(ℓ)} = S^{(ℓ)} : ℓ = 1, \dots, n}$ . Notice that both NFR and UFR terminate automatically after n iterations. It is seen that the UFR algorithm makes use of all response variables simultaneously by the trace operator. It has nearly one qth computation cost of NFR.

We show that the proposed UFR method also possesses the sure screening property. Also, we add a few more assumptions to facilitate the development of the theory.

Assumption 4.1

Assume that (i) $x$ follows elliptically contoured distribution, whose density admits the form $| Σ_{x} |^{- 1 / 2} g {(x - \vec{μ})^{⊤} Σ_{x}^{- 1} (x - \vec{μ})}$ with $\vec{μ} = E x$ and $g (\cdot) > 0$ , denoted by $E C (\vec{μ}, Σ_{x}, g)$ , and that (ii) the distribution of $\vec{ϵ}$ is normal.

Assumption 4.2

There exist positive constants ξ, $ξ_{0}$ and ν such that (i) $\log (p) \leq ν n^{ξ}$ , (ii) $p_{0} \leq ν n^{ξ_{0}}$ , and (iii) $ξ + 6 ξ_{0} + 12 ξ_{min} < 1$ .

Assumption 4.3

The row vectors of B, i.e, ${\vec{β}}_{(k)}$ , $k = 1, \dots, p$ , have the same ‘all-or-nothing’ structure, i.e., the entries of ${\vec{β}}_{(k)}$ are either all zero or none-zero.

Usually, the normality assumption of $x$ is imposed to facilitate theory development (Fan & Lv, Citation2008; H. Wang, Citation2009). Here in Assumption 4.1, we relax it to elliptically contoured distribution and show its sufficiency to obtain Lemma 1 of H. Wang (Citation2009) in Appendix. Assumption 4.1, together with Assumption 3.1, ensures the sparse Riesz assumption (C. Zhang & Huang, Citation2008) to derive some key inequalities in proving Theorem 4.1. Assumption 4.2 has been popularly assumed in the literature of ultra-high dimensional inference (Fan & Lv, Citation2008; H. Wang, Citation2009). It implies that the dimension of the covariates diverges to infinity at an exponential rate (Fan & Lv, Citation2008). Assumption 4.3 implies that all responses are associated with the same covariates (Turlach et al., Citation2005). It warrants the row-wise selection of B by UFR in contrast to the element-wise selection by NFR, which enables UFR to reach the sure screening property in fewer steps than NFR.

Define $K = 2 τ_{max} ν C_{B}^{2} τ_{min}^{- 2} ν_{B}^{- 4}$ , where the factors are defined in Assumptions 3.1, 3.2 and 4.2. Applying Theorem 1 of H. Wang (Citation2009), we can readily get $P (S \subset S_{N F R}^{(q K ν n^{2 ξ_{0} + 4 ξ_{min}})}) \to 1$ , i.e., the NFR selects all relevant predictors with high probability after $q K ν n^{2 ξ_{0} + 4 ξ_{min}}$ steps for the multivariate regression setting. While the following theorem shows that the UFR can do the job in much fewer (one qth) steps.

Theorem 4.1

Under Assumptions 3.1–4.3, $P (S \subset S_{U F R}^{(K ν n^{2 ξ_{0} + 4 ξ_{min}})}) \to 1$ as $n \to \infty$ .

We adopt the BIC criteria to select the best subset of variables from a solution path (Liang et al., Citation2012; H. Wang, Citation2009). Let (4) $\begin{aligned} B I C (M) & = \log [n^{- 1} t r {Y^{⊤} (I_{n} - H_{(M)}) Y}] \\ + n^{- 1} | M | (\log n + 2 \log p), \end{aligned}$ (4) where $H_{(M)} = X_{(M)} {X_{(M)}^{⊤} X_{(M)}}^{- 1} X_{(M)}^{⊤}$ . We then choose the subset of the variables from the solution path which minimizes $B I C (M)$ . The selection consistency was showed by H. Wang (Citation2009) and Sofer et al.(Citation2014).

Note that the UFR method is consistent if $p = O (n^{α})$ for some $α > 0$ , while the MCC method works with $\log (p) = O (n^{α})$ . In this sense, the MCC method can handle higher dimensional variable screening than the UFR method. Secondly, the UFR method is computationally more expensive than the MCC method as the former involves n−1 more times of matrix inversion operation than the latter. On the other hand, when p and n are of the same order, both MCC and UFR perform well in terms of coverage probability and UFR performs better in yielding a parsimonious model with high specificity in terms of model size and correct fit (defined later) as seen in simulation.

5. Simulation

We conduct numerical studies to investigate the finite sample performance of the proposed methods, i.e., MCC and UFR, in comparison with the naive correlation coefficient (NCC) method and the naive forward regression (NFR).

5.1. Models

Consider five models for generating the p-dimensional covariates $x$ in Table , which are adopted from Examples 1 and 2 of Fan and Lv (Citation2008), Example 1 of Tibshirani (Citation1996), and Examples 4 and 5 of H. Wang (Citation2009), respectively.

Table 1. Five models.

Display Table

For models 1 to 3, $x$ follows a multivariate normal distribution with zero mean vector and covariance matrix $Σ_{x}$ of the structure of identity, autoregressive and compound symmetry, respectively. In model 4, $x$ is generated by $x_{r} = (z_{r} + w_{r}) / \sqrt{2}$ for $r = 1, \dots, p_{0}$ and $x_{r} = (z_{r} + \sum_{r^{'} = 1}^{p_{0}} z_{r^{'}}) / 2$ for $r = p_{0} + 1, \dots, p$ , where $z_{r}$ and $w_{r}$ are independent standard normal variables (H. Wang, Citation2009). Note that model 4 is a challenging case as the correlation coefficient of the relevant predictors and the response variables are much smaller than the correlation coefficient of irrelevant predictors and the response variables. The details of $Σ_{x}$ are provided in Table . In model 5, we generate independent components of both $x$ and $\vec{ϵ}$ to be e−1 where e is an exponential random variable with parameter one. This model is used to examine the robustness of the proposed methods against the departure from the normality assumption. Consider the number of predictors p to be 1000, 5000, and 10,000, respectively, which are all much larger than the sample sizes considered in the five models. Recall $p_{0}$ is the number of relevant predictors. Denote the first $p_{0}$ rows of B by $B_{0}$ . We generate independent entries of $B_{0}$ from distributions given in the last column of Table , where $N (4 \log (n) / \sqrt{n}, 1)$ is a normal random variable with mean $4 \log (n) / \sqrt{n}$ and variance 1, $Γ (2, 1)$ denotes a random variable of gamma distribution with shape parameter 2 and scale parameter 1, and $\exp (9)$ is an exponential random variable with parameter 9. They are all independent with $x$ . Set the remaining entries (the last $p - p_{0}$ rows) of B to be zero.

For the multivariate response case, the signal-to-noise ratio is given by $R^{2} = \frac{t r {v a r (B^{⊤} x)}}{t r {v a r (y)}} = \frac{t r (B^{⊤} Σ_{x} B)}{p σ^{2} + t r (B^{⊤} Σ_{x} B)} .$ We chose the values of $σ^{2}$ such that the signal-to-noise ratios are $30 %$ , $60 %$ , and $90 %$ , respectively.

Throughout, set the number of replications N to be 1000.

5.2. Evaluation criteria

For MCC screening, we use Fan and Lv's (Citation2008) hard threshold method to retain the relevant predictors. For both NFR and UFR, we use the BIC criterion (Equation4(4) $\begin{aligned} B I C (M) & = \log [n^{- 1} t r {Y^{⊤} (I_{n} - H_{(M)}) Y}] \\ + n^{- 1} | M | (\log n + 2 \log p), \end{aligned}$ (4) ) to determine the relevant predictors.

We adopt five measures as described in Table to evaluate the finite sample performance of the proposed methods, where the model size (MS) is the number of the selected relevant predictors, the coverage probability (CP) measures how likely all the relevant predictors are identified, the percentage of correctly fitted (CF) measures the capability in identifying the true model correctly, the correct zero (CZ) characterizes the capability in producing sparse solution, and the incorrect zero (IZ) characterizes the method's under-fitting effects. Ideally, we wish a method to have MS close to $p_{0}$ , CP, CF, CZ all close to 100% and IZ close to zero.

Table 2. Measures for the finite sample performance of variable selection.

Display Table

For $b = 1, \dots, N$ , let ${\hat{B}}^{(b)}$ denote the estimate of B under the bth replication. The corresponding selected model is denoted by ${\hat{S}}^{(b)} = {k : {\hat{\vec{β}}}_{(k)}^{(b)} \neq 0, k = 1, \dots p}$ . The empirical MS is computed as $M S = N^{- 1} \sum_{b = 1}^{N} | {\hat{S}}^{(b)} |$ and the empirical values of the other measures are similarly computed.

5.3. Results

Tables report the finite sample performance of the four competing methods in terms of the measures given in Table under various numbers of covariates p and signal strength $R^{2}$ . We summarize the findings as follows. (i) The MCC method is uniformly superior to the NCC method with larger coverage probability (CP), better estimation of sparsity (with larger CZ and smaller IZ), as expected. (ii) As we adopted the fixed threshold procedure for MCC and NCC, these two methods produce conservatively large coverage of predictors at the cost of large model size. For the same reason, the percentage of incorrect zero is larger than the other two regression-based methods (UFR and NFR). So the resulting percentages of correctly fitted models for MCC and NCC are zero. (iii) When comparing UFR with NFR, the UFR demonstrates its superiority over NFR uniformly in all five measures across all five models (including Model 5 with the non-normal distribution). This corroborates the advantage of UFR in utilizing the correlation within responses over NFR. When comparing UFR with PWL, both methods perform comparably when the signal strength is as small as $30 %$ . When the signal strength is as large as $60 %$ or $90 %$ , UFR outperforms PWL in all five measures in general. (iv) The UFR method performs inferior to the MCC method in cases of ultra-high dimensional covariates especially under lower signal strength, as pointed out earlier. For instance, in Model 1 of Table , the coverage probability of UFR reduces by $83 %,$ while the counterpart of MCC reduces by $33 %$ when the dimension of predictors p increases from 1000 to 10,000 at the signal strength of $30 %$ . (v) As for the impact of the signal strength, the percentage of incorrect zeros rises under the weak signal strength cases from those under the strong signal strength cases. It is consistent with the findings for the univariate case in H. Wang (Citation2009) and Y. Li et al. (Citation2017). However, as the signal strength increases (e.g., from $30 %$ to $90 %$ ), the percentages of coverage probability (CP) and probability of correct fit (CF) increase significantly (e.g., $61.9 %$ to $98.3 %$ and $28.8 %$ to $58.8 %$ , respectively, with p = 5000) and the percentage of incorrect zeros (IZ) drops quickly (e.g., from $53.7 %$ to $2.35 %$ with p = 5000) by both NFR and UFR as seen in Table . (vi) To examine the impact of the sample size, Table reports the performance of the proposed methods under Model 1 with a number of covariates p fixed at 5000 and varying sample size n to be 100, 200, and 400, respectively. It is seen that the measures of model size (MS), coverage probability (CP), probability of correct fit (CF) and probability of incorrect zero (IZ) are sensitive to sample size. The improvement of performance is significant. For instance, when the sample size increases from n = 100 to n = 200 with signal strength $R^{2} = 60 %$ , the CP increases from $52.2 %$ to $80.4 %$ on average and the percentage of incorrect zero drops from $47.8 %$ to $19.7 %$ on average.

Table 3. Five measures of the performance of variable selection defined in Table obtained by the four competing methods under various numbers of covariates (p) and signal-to-noise ratio ( $R^{2}$ ) for Model 1 in Table with $(n, q, p_{0}) = (200, 4, 8)$ .

Display Table

Table 4. Five measures of the performance of variable selection defined in Table obtained by the four competing methods under various numbers of covariates (p) and signal to noise ratio ( $R^{2}$ ) for Model 2 in Table with $(n, q, p_{0}) = (75, 5, 3)$ .

Display Table

Table 5. Five measures of the performance of variable selection defined in Table obtained by the four competing methods under various numbers of covariates (p) and signal-to-noise ratio ( $R^{2}$ ) for Model 3 in Table with $(n, q, p_{0}) = (200, 3, 3)$ .

Display Table

Table 6. Five measures of the performance of variable selection defined in Table obtained by the four competing methods under various numbers of covariates (p) and signal to noise ratio ( $R^{2}$ ) for Model 4 in Table with $(n, q, p_{0}) = (300, 2, 5)$ .

Display Table

Table 7. Five measures of the performance of variable selection defined in Table obtained by the four competing methods under various numbers of covariates (p), and signal-to-noise ratio ( $R^{2}$ ) for Model 5 in Table with $(n, q, p_{0}) = (200, 6, 10)$ .

Display Table

Table 8. Five measures of the performance of variable selection defined in Table obtained by the four competing methods under various sample sizes (n) and signal-to-noise ratio ( $R^{2}$ ) for Model 1 in Table with $(p, q, p_{0}) = (5000, 4, 8)$ .

Display Table

In conclusion, the MCC method performs better when the dimension of covariates is ultra-high ( $\log (p) = O (n^{α})$ ) with respect to the sample size and the UFR method outperforms the MCC method when the dimension of covariates is of polynomial order ( $p = O (n^{α})$ ).

6. Real data application

We apply the proposed methods to a real data set regarding bone mineral density (BMD) (Reppe et al., Citation2010). The data were collected from 84 post-menopausal Caucasian women aged from 50 to 86. For each subject, there are two responses, namely the body mass index and total hip z-score (a measure of how strong the bone in the hip), and 8649 gene expression levels in trans-iliacal bone biopsies served as covariates. It is known that low bone mineral density is usually related to fragile bone and osteoporosis and progressive reduction of bone strength which leads to increasing susceptibility of bone fractures (Cooper, Citation1997; Reppe et al., Citation2010). The goal of the study is to identify the genes that are related to BMD.

Table reports the genes identified by the five competing methods. The MCC method identified 19 genes which include all 13 genes identified by NFR except gene TNK2. The PWL method identified 12 genes which all identified by NFR except PAIP1. And the UFR found 10 significant genes which are all contained in the set identified by NFR.

Table 9. Selected genes for the BMD data.

Download CSV Display Table

To examine the quality of variable selection of these methods, we compare the prediction mean square error ( $E ‖ y - B_{S}^{⊤} x_{S} ‖^{2}$ ) obtained by the three methods. To this end, we randomly split the data into a training set of 60 samples and a testing set of the remaining 24 samples. The average prediction mean square errors over the 100 replications for MCC, NCC, NFR and UFR are 273.7, 293.5, 271.0 and 241.6, respectively. Clearly, the UFR method is the winner. All the eight genes (ACSL3, NIPSNAP3B, DLEU2, C1ORF61, DKK1, SOST, ABCA8, and AFFX-M27830-M-at) identified by Reppe et al. (Citation2010) were selected by the four competing methods. The UFR method discovered two more genes, RNF216 and PLIN5, with the smallest prediction mean square errors. Similar to the findings in simulation, both MCC and NCC selected more genes than NFR and UFR with larger prediction error.

7. Conclusion

We propose two methods for variable screening in high-dimensional multivariate linear regression. The MCC method has the advantage of computational ease and can provide fast variable screening to obtain an accurate subset with a dimension below the ample size. The proposed UFR method has the feature of discovering all relevant predictors consistently at nearly the same computational cost as the univariate forward regression. The performance of UFR is sensitive to the dimensionality and signal strength. Our theory assumes Gaussian distribution for the response variables. The numerical study also shows the robustness of the proposed methods against non-normality. It is of interest to investigate the problem under more general non-homogeneously sparse assumption and nonlinear models.

Disclosure statement

No potential conflict of interest was reported by the author(s).

References

Anderson, T. (2003). An introduction to statistical multivariate analysis (3rd ed.). Wiley.
Google Scholar
Bickel, P. J., & Levina, E. (2008). Regularized estimation of large covariance matrices. The Annals of Statistics, 36(1), 199–227. https://doi.org/10.1214/009053607000000758
Web of Science ®Google Scholar
Breiman, L. (1995). Better subset regression using the nonnegative garrote. Technometrics, 37(4), 373–384. https://doi.org/10.1080/00401706.1995.10484371
Web of Science ®Google Scholar
Breiman, L., & Friedman, J. H. (1997). Predicting multivariate responses in multiple linear regression. Journal of the Royal Statistical Society: Series B, 59(1), 3–54. https://doi.org/10.1111/rssb.1997.59.issue-1
Google Scholar
Cai, T., Li, H., Liu, W., & Xie, J. (2013). Covariate–adjusted precision matrix estimation with an application in genetical genomics. Biometrika, 100(1), 139–156. https://doi.org/10.1093/biomet/ass058
PubMed Web of Science ®Google Scholar
Cai, T., Liu, W., & Luo, X. (2011). A constrained l1 minimization approach to sparse precision matrix estimation. Journal of the American Statistical Association, 106(494), 594–607. https://doi.org/10.1198/jasa.2011.tm10155
Web of Science ®Google Scholar
Cai, T., & Lv, J. (2007). Discussion: The Dantzig selector: Statistical estimation when p is much larger than n. Annals of Statistics, 35(6), 2365–2369. https://doi.org/10.1214/009053607000000442
Google Scholar
Candes, E., & Tao, T. (2007). The Dantzig selector: Statistical estimation when p is much larger than n. Annals of Statistics, 35(6), 2313–2351. https://doi.org/10.1214/009053606000001523
Web of Science ®Google Scholar
Chen, L., & Huang, J. Z. (2012). Sparse reduced-rank regression for simultaneous dimension reduction and variable selection. Journal of the American Statistical Association, 107(500), 1533–1545. https://doi.org/10.1080/01621459.2012.734178
Web of Science ®Google Scholar
Cooper, C. (1997). The crippling consequences of fractures and their impact on quality of life. American Journal of Medicine, 103(2), 12–19. https://doi.org/10.1016/S0002-9343(97)90022-X
Google Scholar
Deshpande, S., Rockova, V., & George, E. (2019). Simultaneous variable and covariance selection with the multivariate Spike- and Slab lasso. Journal of Computational and Graphical Statistics, 28(4), 921–931. https://doi.org/10.1080/10618600.2019.1593179
Web of Science ®Google Scholar
Efron, B., Hastie, T., Johnstone, I., & Tibshirani, R. (2004). Least angle regression. Annals of Statistics, 32(2), 407–499. https://doi.org/10.1214/009053604000000067
Web of Science ®Google Scholar
Fan, J., & Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456), 1348–1360. https://doi.org/10.1198/016214501753382273
Web of Science ®Google Scholar
Fan, J., & Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space (with discussion). Journal of the Royal Statistical Society: Series B, 70(5), 849–911. https://doi.org/10.1111/rssb.2008.70.issue-5
Google Scholar
Fan, J., & Peng, H. (2004). Non-concave penalized likelihood with a diverging number of parameters. Annals of Statistics, 32(3), 928–961. https://doi.org/10.1214/009053604000000256
Web of Science ®Google Scholar
Fang, K.-T., Kotz, S., & Ng, K. W. (2018). Symmetric multivariate and related distributions. Chapman and Hall/CRC.
Google Scholar
Ferte, C., Trister, A. D., Erich, H., & Bot, B. (2013). Impact of bioinformatic procedures in the development and translation of high-throughput molecular classifiers in oncology. Clinical Cancer Research, 19(16), 4315–4325. https://doi.org/10.1158/1078-0432.CCR-12-3937
PubMed Web of Science ®Google Scholar
Frank, L., & Friedman, J. (1993). A statistical view of some chemometrics regression tools. Technometrics, 35(2), 10–135. https://doi.org/10.1080/00401706.1993.10485033
Web of Science ®Google Scholar
Fu, W. J. (1998). Penalized regressions: The bridge versus the lasso. Journal of Computational and Graphical Statistics, 7(3), 397–416. https://doi.org/10.1080/10618600.1998.10474784
Web of Science ®Google Scholar
He, K., Lian, H., Ma, S., & Huang, J. Z. (2018). Dimensionality reduction and variable selection in multivariate varying-coefficient models with a large number of covariates. Journal of Statistical Planning and Inference, 113(522), 746–754. https://doi.org/10.1080/01621459.2017.1285774
Google Scholar
Jia, B., Xu, S., Xiao, G., & Lambda, V. (2017). Learning gene regulatory networks from next generation sequencing data. Biometrics, 73(4), 1221–1230. https://doi.org/10.1111/biom.v73.4
Web of Science ®Google Scholar
Kim, S., Sohn, K., & Xing, E. (2009). A multivariate regression approach to association analysis of a quantitative trait network. Bioinformatics (Oxford, England), 25(12), 204–212. https://doi.org/10.1093/bioinformatics/btp218
Web of Science ®Google Scholar
Kong, X., Liu, Z., Yao, Y., & Zhou, W. (2017). Sure screening by ranking the canonical correlations. Test, 26(1), 46–70. https://doi.org/10.1007/s11749-016-0497-z
Web of Science ®Google Scholar
Kong, Y., Zheng, Z., & Lv, J. (2016). The constrained Dantzing selector with enhanced consistency. Journal of Machine Learning Research, 17(123), 1–22.
Google Scholar
Lee, W., & Liu, Y. (2012). Simultaneous multiple response regression and inverse covariate matrix estimation via penalized Gaussian maximum likelihood. Journal of Multivariate Analysis, 111, 241–255. https://doi.org/10.1016/j.jmva.2012.03.013
PubMed Web of Science ®Google Scholar
Li, B., Chuns, H., & Zhao, H. (2012). Sparse estimation of conditional graphical models with application to gene networks. Journal of American Statistical Association, 107(497), 152–167. https://doi.org/10.1080/01621459.2011.644498
PubMed Web of Science ®Google Scholar
Li, C., & Li, H. (2008). Network-constrained regularization and variable selection for analysis of genomic data. Bioinformatics (Oxford, England), 24(9), 1175–1182. https://doi.org/10.1093/bioinformatics/btn081
PubMed Web of Science ®Google Scholar
Li, G., Peng, H., Zhang, J., & Zhu, L. (2012). Robust rank correlation based screening. Annals of Statistics, 40(3), 1846–1877. https://doi.org/10.1214/12-AOS1024
Web of Science ®Google Scholar
Li, Y., Li, G., Lian, H., & Tong, T. (2017). Profile forward regression screening for ultra-high dimensional semiparametric varying coefficient partially linear models. Journal of Multivariate Analysis, 155, 133–150. https://doi.org/10.1016/j.jmva.2016.12.006
Web of Science ®Google Scholar
Li, Y., Nan, B., & Zhu, J. (2015). Multivariate sparse group lasso for the multivariate multiple linear regression with an arbitrary group structure. Biometrics, 71(2), 354–363. https://doi.org/10.1111/biom.v71.2
Web of Science ®Google Scholar
Liang, H., Wang, H., & Tsai, C.-L. (2012). Profiled forward regression for ultrahigh dimensional variable screening in semiparametric partially linear model. Statistica Sinica, 22(2), 531–554. https://doi.org/10.5705/ss.2010.134
Web of Science ®Google Scholar
Obozinski, G., Wainwright, M. J., & Jordan, M. I. (2011). Support union recovery in high-dimensional multivariate regression. Annals of Statistics, 39(1), 1–47. https://doi.org/10.1214/09-AOS776
Web of Science ®Google Scholar
Pecanka, J., van der Vaart, A. W., & Marianne, J. (2019). Modeling association between multivariate correlated and high-dimensional sparse covariates: The adaptive SVS method. Journal of Applied Statistics, 46(5), 893–913. https://doi.org/10.1080/02664763.2018.1523377
Web of Science ®Google Scholar
Peng, J., Zhu, J., Bergamaschi, A., Han, W., Noh, D.-Y., Pollack, J. R., & Wang, P. (2010). Regularized multivariate regression for identifying master predictors with application to integrative genomics study of breast cancer. Annals of Applied Statistics, 4(1), 53–77. https://doi.org/10.1214/09-AOAS271
PubMed Web of Science ®Google Scholar
Ravikumar, P., Wainwright, M., & Lafferty, J. (2010). High-dimensional Ising model selection using l1 regularized logistic regression. Annals of Statistics, 38(3), 1287–1319. https://doi.org/10.1214/09-AOS691
Web of Science ®Google Scholar
Ren, J., Du, Y., Li, S., Ma, S., Jiang, Y., & Wu, C. (2019). Robust network-based regularization and variable selection for high-dimensional genomic data in cancer prognosis. Genetic Epidemiology, 43(3), 276–291. https://doi.org/10.1002/gepi.2018.43.issue-3
PubMed Web of Science ®Google Scholar
Ren, J., He, T., Li, Y., Liu, S., Du, Y., Jiang, Y., & Wu, C. (2017). Network-based regularization for high dimensional SNP data in the case-control study of type 2 diabetes. BMC Genetics, 18(1), 44. https://doi.org/10.1186/s12863-017-0495-5
Google Scholar
Reppe, S., Refvem, H., Gautvik, V. T., Olstad, O. K., H∅vring, P. I., Reinholt, F. P., Holden, M., Frigessi, A., Jemtland, R., & Gautvik, K. M. (2010). Eight genes are highly associated with BMD variation in postmenopausal Caucasian women. Bone, 46(3), 604–612. https://doi.org/10.1016/j.bone.2009.11.007
PubMed Web of Science ®Google Scholar
Rothman, A. J., Levina, E., & Zhu, J. (2010). Sparse multivariate regression with covariance estimation. Journal of Computational and Graphical Statistics, 19(4), 947–962. https://doi.org/10.1198/jcgs.2010.09188
PubMed Web of Science ®Google Scholar
Saulis, L., & Statulevicius, V. (1991). Limit theorems for large deviations (Vol. 73). Springer Science & Business Media.
Google Scholar
Setdji, C. M., & R. D. Cook (2004). K-means inverse regression. Technometrics, 46(4), 421–429. https://doi.org/10.1198/004017004000000437
Web of Science ®Google Scholar
Smith, M., & Fahrmeir, L. (2007). Spatial Bayesian variable selection with application to functional magnetic resonance imaging. Journal of American Statistical Association, 102(478), 417–431. https://doi.org/10.1198/016214506000001031
Web of Science ®Google Scholar
Sofer, T., Dicker, L., & Lin, X. (2014). Variable selection for high dimensional multivariate outcomes. Statistica Sinica, 24(4), 1633–1654. https://doi.org/10.5705/ss.2013.019
Web of Science ®Google Scholar
Song, Y., Schreier, P. J., Ramirez, D., & Hasija, T. (2016). Canonical correlation analysis of high-dimensional data with very small sample support. Signal Processing, 128, 449–458. https://doi.org/10.1016/j.sigpro.2016.05.020
Web of Science ®Google Scholar
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B, 58(1), 267–288. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x
Google Scholar
Turlach, B., Venables, W., & Wright, S. (2005). Simultaneous variable selection. Technometrics, 47(3), 349–363. https://doi.org/10.1198/004017005000000139
Web of Science ®Google Scholar
Wang, H. (2009). Forward regression for ultra-high dimensional variable screening. Journal of the American Statistical Association, 104(488), 1512–1524. https://doi.org/10.1198/jasa.2008.tm08516
Web of Science ®Google Scholar
Wang, J., Zhang, Z., & Ye, J. (2019). Two-layer feature reduction for sparse-group lasso via decomposition of convex sets. Journal of Machine Learning, 20(163), 1–42.
Google Scholar
Yang, Y., & Zou, H. (2015). A fast unified algorithm for solving group-lasso penalize learning problems. Statistics and Computing, 25(6), 1129–1141. https://doi.org/10.1007/s11222-014-9498-5
Web of Science ®Google Scholar
Yi, N. (2010). Statistical analysis of genetic interactions. Genetics Research, 92(5–6), 443–459. https://doi.org/10.1017/S0016672310000595
Web of Science ®Google Scholar
Yin, J., & Li, H. (2011). A sparse conditional Gaussian graphical model for analysis of genetical genomics data. Annals of Applied Statistics, 5(4), 2630–2650. https://doi.org/10.1214/11-AOAS494
PubMed Web of Science ®Google Scholar
Zhang, C., & Huang, J. (2008). The sparsity and bias of the lasso selection in high-dimensional linear regression. Annals of Statistics, 36(4), 1567–1594. https://doi.org/10.1214/07-AOS520
Web of Science ®Google Scholar
Zhang, H. H., & Lu, W. (2007). Adaptive lasso for Cox's proportional hazards model. Biometrika, 94(3), 691–703. https://doi.org/10.1093/biomet/asm037
Web of Science ®Google Scholar
Zhang, N., Yu, Z., & Wu, Q. (2019). Overlapping sliced inverse regression for dimension reduction. Analysis and Applications, 17(5), 715–736. https://doi.org/10.1142/S0219530519400013
Web of Science ®Google Scholar
Zhao, W., Lian, H., & Ma, S. (2017). Robust reduced–rank modeling via rank regression. Journal of Statistical Planning and Inference, 180, 1–12. https://doi.org/10.1016/j.jspi.2016.08.009
Web of Science ®Google Scholar
Zhu, L., Li, L., Li, R., & Zhu, L. (2011). Model-free feature screening for ultrahigh-dimensional data. Journal of the American Statistical Association, 106(496), 1464–1475. https://doi.org/10.1198/jasa.2011.tm10563
PubMed Web of Science ®Google Scholar
Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101(476), 1418–1429. https://doi.org/10.1198/016214506000000735
Web of Science ®Google Scholar
Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B, 67(2), 301–320. https://doi.org/10.1111/rssb.2005.67.issue-2
Google Scholar

Appendix

A.1. Proof of Theorem 3.1

Let

C_{K} = τ + η^{- 1} K

and

ϑ = 3 η^{- 1} C_{K} \sqrt{n \log (p q)}

. When

ρ_{k}^{2} \geq τ

for all

k \in S

, it suffices to show that under Assumptions 3.1–3.3

(A1)

\begin{aligned} P (max_{1 \leq k \leq p} | {\hat{ρ}}_{k}^{2} - ρ_{k}^{2} | \geq ϑ) \\ = P (max_{1 \leq k \leq p} | \frac{1}{n} \sum_{i = 1}^{n} {\hat{\vec{γ}}}_{k}^{⊤} y_{i} x_{i k} - E {\vec{γ}}_{k}^{⊤} y x_{k} | \geq ϑ) \\ < 3 p^{- τ} q^{- τ} . \end{aligned}

(A1) Observe that

(A2)

\begin{aligned} | {\hat{ρ}}_{k}^{2} - ρ_{k}^{2} | \\ = | ({\hat{\vec{γ}}}_{k} - {\vec{γ}}_{k})^{⊤} \frac{1}{n} \sum_{i = 1}^{n} y_{i} x_{i k} + {\vec{γ}}_{k}^{⊤} (\frac{1}{n} \sum_{i = 1}^{n} y_{i} x_{i k} - E y x_{k}) | \\ \leq | ({\hat{\vec{γ}}}_{k} - {\vec{γ}}_{k})^{⊤} \frac{1}{n} \sum_{i = 1}^{n} y_{i} x_{i k} | + | {\vec{γ}}_{k}^{⊤} (\frac{1}{n} \sum_{i = 1}^{n} y_{i} x_{i k} - E y x_{k}) | \\ ≜ A_{1} + A_{2} . \end{aligned}

(A2) First, we compute the rate of convergence for

A_{2}

. For

k = 1, \dots, p

, let

ω_{k} = n^{- 1} \sum_{i = 1}^{n} {\vec{γ}}_{k}^{⊤} (y_{i} x_{i k} - E y x_{k})

. (So

A_{2} = | ω_{k} |

.) Let

t_{1} = η \sqrt{n^{- 1} \log (p q)}

. Applying the inequalities

P (| U | \geq V) \leq e^{- t_{1} V} E e^{t_{1} | U |}

for any V >0 and

| e^{s} - 1 - s | \leq s^{2} e^{max (s, 0)}

for any

s \in R

, we obtain

(A3)

\begin{aligned} P (A_{2} \geq η^{- 1} C_{K} \sqrt{n \log (p q)}) \\ \leq e^{- C_{K} \log (p q)} E \exp (t_{1} | ω_{k} |) \\ \leq \exp {- C_{K} \log (p q) + t_{1}^{2} E ω_{k}^{2} \exp (t_{1} | ω_{k} |)} \\ \leq \exp {- C_{K} \log (p q) + η^{- 1} K \log (p q)} \\ = p^{- τ} q^{- τ} . \end{aligned}

(A3) Second, we compute the rate of convergence for

A_{1}

. Notice that

(A4)

\begin{aligned} | ({\hat{γ}}_{k} - {\vec{γ}}_{k})^{⊤} \frac{1}{n} \sum_{i = 1}^{n} y_{i} x_{i k} | & \leq | {\hat{γ}}_{k}^{⊤} \frac{1}{n} \sum_{i = 1}^{n} y_{i} x_{i k} | \\ + | γ_{k}^{⊤} \frac{1}{n} \sum_{i = 1}^{n} y_{i} x_{i k} | . \end{aligned}

(A4) Parallel to (EquationA3), we have

(A5)

\begin{aligned} P (| {\hat{γ}}_{k}^{⊤} \frac{1}{n} \sum_{i = 1}^{n} y_{i} x_{i k} | \geq η^{- 1} C_{K} \sqrt{n \log (p q)}) \\ \leq e^{- C_{K} \log (p q)} E \exp {t_{1} (| {\hat{γ}}_{k}^{⊤} \frac{1}{n} \sum_{i = 1}^{n} y_{i} x_{i k} |)} \\ \leq \exp {- C_{K} \log (p q) + t_{1}^{2} E {({\hat{γ}}_{k}^{⊤} \frac{1}{n} \sum_{i = 1}^{n} y_{i} x_{i k})}^{2} \\ \times \exp (t_{1} | {\hat{γ}}_{k}^{⊤} \frac{1}{n} \sum_{i = 1}^{n} y_{i} x_{i k} |)} \\ \leq \exp {- C_{K} \log (p q) + η^{- 1} K \log (p q)} \\ = p^{- τ} q^{- τ} . \end{aligned}

(A5) Similar to (EquationA5),

(A6)

P (| γ_{k}^{⊤} \frac{1}{n} \sum_{i = 1}^{n} y_{i} x_{i k} | \geq η^{- 1} C_{K} \sqrt{n \log (p q)}) \leq p^{- τ} q^{- τ} .

(A6) Combining (EquationA4), (EquationA5) and (EquationA6),

(A7)

\begin{aligned} P (A_{1} = | ({\hat{γ}}_{k} - {\vec{γ}}_{k})^{⊤} \frac{1}{n} \sum_{i = 1}^{n} y_{i} x_{i k} | \geq 2 η^{- 1} C_{K} \sqrt{n \log (p q)}) \\ \leq 2 p^{- τ} q^{- τ} . \end{aligned}

(A7) At last, combining (EquationA3) and (EquationA7), we get

P (A_{1} + A_{2} \geq 3 η^{- 1} C_{K} \sqrt{n \log (p q)}) \leq 3 p^{- τ} q^{- τ},

which, coupled with (EquationA2), implies (EquationA1). This completes the proof.

A.2. A lemma used to prove theorem 4.1

Lemma A.1

Suppose Assumptions 3.1, 4.1 and 4.2 hold. Let $Σ_{x (M)}$ denote the submatrix of $Σ_{x}$ after M. Let ${\hat{Σ}}_{x (M)}$ and ${\hat{Σ}}_{x}$ denote the corresponding estimators, respectively. Suppose $m = O (n^{2 ξ_{0}} + 4 ξ_{min})$ , where $ξ_{0}$ and $ξ_{min}$ are defined in Assumption 4.2. Then, with probability tending to one, we have (A8) $τ_{min} \leq min_{| M | \leq m} λ_{min} ({\hat{Σ}}_{x (M)}) \leq max_{| M | \leq m} λ_{max} ({\hat{Σ}}_{x (M)}) \leq τ_{max} .$ (A8)

Proof.

The proof is similar to that for Lemma 1 of H. Wang (Citation2009). Here, we relax the normality assumption of $x$ to the elliptically contoured distribution.

First, for $i = 1, \dots, n$ , $j, k = 1, \dots, p$ , let $U_{i} = (z_{i j} + z_{i k}) / \sqrt{2 (1 + ρ_{j k})}$ and $V_{i} = (z_{i j} - z_{i k}) / \sqrt{2 (1 - ρ_{j k})}$ . By Assumption 4.1 and the additive property of elliptical contoured distribution (Fang et al., Citation2018), $(z_{i j}, z_{i k}) \sim {E C}_{2} (0, 0, 1, 1, ρ_{j k}, g)$ , $U_{i} \overset{i . i . d .}{\sim} E C (0, 1, g)$ and $V_{i} \overset{i . i . d .}{\sim} E C (0, 1, g)$ .

Second, observe that $\begin{aligned} \sum_{i = 1}^{n} (z_{i j} z_{i k} - ρ_{j k}) & = \frac{1}{4} [\sum_{i = 1}^{n} {(z_{i j} + z_{i k})^{2} - 2 (1 + ρ_{j k})} \\ - \sum_{i = 1}^{n} {(z_{i j} - z_{i k})^{2} - 2 (1 - ρ_{j k})}] . \end{aligned}$ Then, following Lemma A.3 of Bickel and Levina (Citation2008), we have (A9) $\begin{aligned} P (| {\hat{σ}}_{i j} - σ_{i j} | \geq n v) \\ = P (| \sum_{i = 1}^{n} (x_{i j} x_{i k} - σ_{j k}) | \geq n v) \\ = P (| \sum_{i = 1}^{n} (z_{i j} z_{i k} - ρ_{j k}) | \geq \frac{n v}{(σ_{j j} σ_{k k})^{1 / 2}}) \\ = P [| \sum_{i = 1}^{n} {(z_{i j} + z_{i k})^{2} - 2 (1 + ρ_{j k})} \\ - \sum_{i = 1}^{n} {(z_{i j} - z_{i k})^{2} - 2 (1 - ρ_{j k})} | \geq \frac{4 n v}{(σ_{j j} σ_{k k})^{1 / 2}}] \\ \leq P [| \sum_{i = 1}^{n} {(z_{i j} + z_{i k})^{2} - 2 (1 + ρ_{j k})} | \\ + | \sum_{i = 1}^{n} {(z_{i j} - z_{i k})^{2} - 2 (1 - ρ_{j k})} | \geq \frac{4 n v}{(σ_{j j} σ_{k k})^{1 / 2}}] \\ \leq P [| \sum_{i = 1}^{n} {(z_{i j} + z_{i k})^{2} - 2 (1 + ρ_{j k})} | \geq \frac{4 n v}{(σ_{j j} σ_{k k})^{1 / 2}}] \\ + P [| \sum_{i = 1}^{n} {(z_{i j} - z_{i k})^{2} - 2 (1 - ρ_{j k})} | \geq \frac{4 n v}{(σ_{j j} σ_{k k})^{1 / 2}}] \\ = P {| \sum_{i = 1}^{n} (U_{i}^{2} - 1) | \geq \frac{2 n v}{(1 + ρ_{j k}) (σ_{j j} σ_{k k})^{1 / 2}}} \\ + P {| \sum_{i = 1}^{n} (V_{i}^{2} - 1) | \geq \frac{2 n v}{(1 - ρ_{j k}) (σ_{j j} σ_{k k})^{1 / 2}}} . \end{aligned}$ (A9) Further, for $i = 1, \dots, n$ , let $W_{i} = U_{i}^{2} - 1$ and $B_{n}^{2} = \sum_{i = 1}^{n} var (W_{i})$ . Observe that by Jensen inequality there exist positive constants $c_{1}, \dots, c_{n}$ , $C_{3}$ and $C_{4}$ such that $\begin{aligned} | \frac{\ln {E \exp (ζ W_{i})}}{ζ^{2}} | \leq c_{i}^{2} f o r | ζ | < C_{3} a n d \\ {lim^{¯}}_{n \to \infty} \frac{1}{B_{n}^{2}} \sum_{i = 1}^{n} c_{n}^{2} \leq C_{4}, \end{aligned}$ satisfying condition (P) of Saulis and Statulevicius (Citation1991). The same result holds when $W_{i} = V_{i}^{2} - 1$ . By Theorem 3.2 of Saulis and Statulevicius (Citation1991), the first and second terms of (EquationA9(A9) $\begin{aligned} P (| {\hat{σ}}_{i j} - σ_{i j} | \geq n v) \\ = P (| \sum_{i = 1}^{n} (x_{i j} x_{i k} - σ_{j k}) | \geq n v) \\ = P (| \sum_{i = 1}^{n} (z_{i j} z_{i k} - ρ_{j k}) | \geq \frac{n v}{(σ_{j j} σ_{k k})^{1 / 2}}) \\ = P [| \sum_{i = 1}^{n} {(z_{i j} + z_{i k})^{2} - 2 (1 + ρ_{j k})} \\ - \sum_{i = 1}^{n} {(z_{i j} - z_{i k})^{2} - 2 (1 - ρ_{j k})} | \geq \frac{4 n v}{(σ_{j j} σ_{k k})^{1 / 2}}] \\ \leq P [| \sum_{i = 1}^{n} {(z_{i j} + z_{i k})^{2} - 2 (1 + ρ_{j k})} | \\ + | \sum_{i = 1}^{n} {(z_{i j} - z_{i k})^{2} - 2 (1 - ρ_{j k})} | \geq \frac{4 n v}{(σ_{j j} σ_{k k})^{1 / 2}}] \\ \leq P [| \sum_{i = 1}^{n} {(z_{i j} + z_{i k})^{2} - 2 (1 + ρ_{j k})} | \geq \frac{4 n v}{(σ_{j j} σ_{k k})^{1 / 2}}] \\ + P [| \sum_{i = 1}^{n} {(z_{i j} - z_{i k})^{2} - 2 (1 - ρ_{j k})} | \geq \frac{4 n v}{(σ_{j j} σ_{k k})^{1 / 2}}] \\ = P {| \sum_{i = 1}^{n} (U_{i}^{2} - 1) | \geq \frac{2 n v}{(1 + ρ_{j k}) (σ_{j j} σ_{k k})^{1 / 2}}} \\ + P {| \sum_{i = 1}^{n} (V_{i}^{2} - 1) | \geq \frac{2 n v}{(1 - ρ_{j k}) (σ_{j j} σ_{k k})^{1 / 2}}} . \end{aligned}$ (A9) ) are bounded, respectively, by $2 \exp {- \frac{2 n v^{2}}{(1 + ρ_{j k})^{2} (σ_{j j} σ_{k k}) + 2 v (1 + ρ_{j k}) (σ_{j j} σ_{k k})^{1 / 2}}}$ and $2 \exp {- \frac{2 n v^{2}}{(1 - ρ_{j k})^{2} (σ_{j j} σ_{k k}) + 2 v (1 - ρ_{j k}) (σ_{j j} σ_{k k})^{1 / 2}}} .$ Therefore, $\begin{aligned} P (| {\hat{σ}}_{i j} - σ_{i j} | \geq n v) \\ \leq 4 max [\exp {\frac{- 2 n v^{2}}{\begin{matrix} (1 + ρ_{j k})^{2} (σ_{j j} σ_{k k}) \\ + 2 v (1 + ρ_{j k}) (σ_{j j} σ_{k k})^{1 / 2} \end{matrix}}}, \\ \times \exp {\frac{- 2 n v^{2}}{(1 - ρ_{j k})^{2} (σ_{j j} σ_{k k}) + 2 v (1 - ρ_{j k}) (σ_{j j} σ_{k k})^{1 / 2}}}] . \end{aligned}$

This, together with $λ_{max} (Σ_{x}) < 2^{- 1} τ_{max}$ , implies that (A10) $P (| {\hat{σ}}_{i j} - σ_{i j} | \geq n v) \leq C_{1} \exp (- C_{2} n v^{2}) f o r | v | \leq δ,$ (A10) where the positive constants $C_{1}$ , $C_{2}$ and δ all depend on $τ_{max}$ alone (Bickel & Levina, Citation2008, Lemma A.3).

The rest of the proof follows exactly the same as that for Lemma 1 of H. Wang (Citation2009).

A.3. Proof of theorem 4.1

Our proof follows similar arguments as in the proof of Theorem 1 of H. Wang (Citation2009).

Assume that no relevant predictor has been discovered in the first ℓ iterations, i.e., $S ⊄ S^{(ℓ)}$ . We evaluate the probability that at least one relevant will be identified in the $(ℓ + 1)$ 's iteration or equivalently its complementary probability that the predictor selected by the $(ℓ + 1)$ 's iteration is still an irrelevant one.

Let $X_{(S)} = (x_{1 (S)}, \dots, x_{n (S)})$ denote the subset of X corresponding to S. Let $B_{(S)}$ denote the coefficient matrix under the true model.

Denote $\begin{aligned} H_{(S^{(ℓ)})} & = X_{(S^{(ℓ)})} {X_{(S^{(ℓ)})}^{⊤} X_{(S^{(ℓ)})}}^{- 1} X_{(S^{(ℓ)})}^{⊤}, \\ {\tilde{H}}_{k}^{(ℓ)} & = X_{M_{k}^{(ℓ)}} (X_{M_{k}^{(ℓ)}}^{⊤} X_{M_{k}^{(ℓ)}})^{- 1} X_{M_{k}^{(ℓ)}}^{⊤}, \\ x_{k}^{(ℓ)} & = (I_{n} - H_{S^{(ℓ)}}) x_{k}, \\ H_{k}^{(ℓ)} & = x_{k}^{(ℓ)} x_{k}^{(ℓ)^{⊤}} ‖ x_{k}^{(ℓ)} ‖^{- 2} . \end{aligned}$ Observe that (A11) $\begin{aligned} Ω (ℓ) & = R S S (S^{(ℓ)}) - R S S (S^{(ℓ + 1)}) \\ = t r {Y^{⊤} (I_{n} - {\tilde{H}}_{k}^{(ℓ)}) Y} - t r {Y^{⊤} (I_{n} - {\tilde{H}}_{k}^{(ℓ + 1)}) Y} \\ = t r {Y^{⊤} ({\tilde{H}}_{k}^{(ℓ + 1)} - {\tilde{H}}_{k}^{(ℓ)}) Y} \\ = t r {Y^{⊤} (I_{n} - H_{(S^{(ℓ)})})^{⊤} H_{a_{ℓ + 1}}^{(ℓ)^{⊤}} H_{a_{ℓ + 1}}^{(ℓ)} (I_{n} - H_{(S^{(ℓ)})}) Y} . \end{aligned}$ (A11) Assume $a_{ℓ + 1} \notin S$ . We have (A12) $\begin{aligned} Ω (ℓ) & \geq max_{j \in S} t r {Y^{⊤} (I_{n} - H_{(S^{(ℓ)})})^{⊤} H_{k}^{(ℓ)^{⊤}} H_{k}^{(ℓ)} (I_{n} - H_{(S^{(ℓ)})}) Y} \\ \geq t r {Y^{⊤} (I_{n} - H_{(S^{(ℓ)})})^{⊤} H_{\hat{k}}^{(ℓ)^{⊤}} H_{\hat{k}}^{(ℓ)} (I_{n} - H_{(S^{(ℓ)})}) Y}, \end{aligned}$ (A12) where $\begin{aligned} \hat{k} & = {a r g m a x}_{k \in S} t r {B_{(S)}^{⊤} X_{(S)} (I_{n} - H_{(S^{(ℓ)})})^{⊤} \\ H_{k}^{(ℓ)^{⊤}} H_{k}^{(ℓ)} (I_{n} - H_{(S^{(ℓ)})}) X_{(S)}^{⊤} B_{(S)}} . \end{aligned}$ Further, observe that the last inequality of (EquationA12(A12) $\begin{aligned} Ω (ℓ) & \geq max_{j \in S} t r {Y^{⊤} (I_{n} - H_{(S^{(ℓ)})})^{⊤} H_{k}^{(ℓ)^{⊤}} H_{k}^{(ℓ)} (I_{n} - H_{(S^{(ℓ)})}) Y} \\ \geq t r {Y^{⊤} (I_{n} - H_{(S^{(ℓ)})})^{⊤} H_{\hat{k}}^{(ℓ)^{⊤}} H_{\hat{k}}^{(ℓ)} (I_{n} - H_{(S^{(ℓ)})}) Y}, \end{aligned}$ (A12) ) is no less than (A13) $\begin{aligned} t r {B_{(S)}^{⊤} X_{(S)} (I_{n} - H_{(S^{(ℓ)})})^{⊤} H_{\hat{k}}^{(ℓ)^{⊤}} H_{\hat{k}}^{(ℓ)} (I_{n} - H_{(S^{(ℓ)})}) X_{(S)}^{⊤} B_{(S)}} \\ - t r {ϵ^{⊤} (I_{n} - H_{(S^{(ℓ)})})^{⊤} H_{\hat{k}}^{(ℓ)^{⊤}} H_{\hat{k}}^{(ℓ)} (I_{n} - H_{(S^{(ℓ)})}) ϵ} \\ \geq max_{k \in S} t r {B_{(S)}^{⊤} X_{(S)} (I_{n} - H_{(S^{(ℓ)})})^{⊤} H_{k}^{(ℓ)^{⊤}} \\ H_{k}^{(ℓ)} (I_{n} - H_{(S^{(ℓ)})}) X_{(S)}^{⊤} B_{(S)}} \\ - max_{k \in S} t r {ϵ^{⊤} (I_{n} - H_{(S^{(ℓ)})})^{⊤} H_{k}^{(ℓ)^{⊤}} H_{k}^{(ℓ)} (I_{n} - H_{(S^{(ℓ)})}) ϵ}, \end{aligned}$ (A13)

where $ϵ = ({\vec{ϵ}}_{1}, \dots, {\vec{ϵ}}_{n})^{⊤} \in R^{n \times q}$ .

In what follows, we study the two terms in (EquationA13(A13) $\begin{aligned} t r {B_{(S)}^{⊤} X_{(S)} (I_{n} - H_{(S^{(ℓ)})})^{⊤} H_{\hat{k}}^{(ℓ)^{⊤}} H_{\hat{k}}^{(ℓ)} (I_{n} - H_{(S^{(ℓ)})}) X_{(S)}^{⊤} B_{(S)}} \\ - t r {ϵ^{⊤} (I_{n} - H_{(S^{(ℓ)})})^{⊤} H_{\hat{k}}^{(ℓ)^{⊤}} H_{\hat{k}}^{(ℓ)} (I_{n} - H_{(S^{(ℓ)})}) ϵ} \\ \geq max_{k \in S} t r {B_{(S)}^{⊤} X_{(S)} (I_{n} - H_{(S^{(ℓ)})})^{⊤} H_{k}^{(ℓ)^{⊤}} \\ H_{k}^{(ℓ)} (I_{n} - H_{(S^{(ℓ)})}) X_{(S)}^{⊤} B_{(S)}} \\ - max_{k \in S} t r {ϵ^{⊤} (I_{n} - H_{(S^{(ℓ)})})^{⊤} H_{k}^{(ℓ)^{⊤}} H_{k}^{(ℓ)} (I_{n} - H_{(S^{(ℓ)})}) ϵ}, \end{aligned}$ (A13) ) separately.

Step 1: The first term of (EquationA13(A13) $\begin{aligned} t r {B_{(S)}^{⊤} X_{(S)} (I_{n} - H_{(S^{(ℓ)})})^{⊤} H_{\hat{k}}^{(ℓ)^{⊤}} H_{\hat{k}}^{(ℓ)} (I_{n} - H_{(S^{(ℓ)})}) X_{(S)}^{⊤} B_{(S)}} \\ - t r {ϵ^{⊤} (I_{n} - H_{(S^{(ℓ)})})^{⊤} H_{\hat{k}}^{(ℓ)^{⊤}} H_{\hat{k}}^{(ℓ)} (I_{n} - H_{(S^{(ℓ)})}) ϵ} \\ \geq max_{k \in S} t r {B_{(S)}^{⊤} X_{(S)} (I_{n} - H_{(S^{(ℓ)})})^{⊤} H_{k}^{(ℓ)^{⊤}} \\ H_{k}^{(ℓ)} (I_{n} - H_{(S^{(ℓ)})}) X_{(S)}^{⊤} B_{(S)}} \\ - max_{k \in S} t r {ϵ^{⊤} (I_{n} - H_{(S^{(ℓ)})})^{⊤} H_{k}^{(ℓ)^{⊤}} H_{k}^{(ℓ)} (I_{n} - H_{(S^{(ℓ)})}) ϵ}, \end{aligned}$ (A13) ). Define $Q_{(S^{(k)})} = I_{n} - H_{(S^{(k)})}$ . And denote $x_{k}^{(ℓ)} = x_{k}^{⊤} Q_{(S^{(ℓ)})}$ . Then, the first term in (EquationA13(A13) $\begin{aligned} t r {B_{(S)}^{⊤} X_{(S)} (I_{n} - H_{(S^{(ℓ)})})^{⊤} H_{\hat{k}}^{(ℓ)^{⊤}} H_{\hat{k}}^{(ℓ)} (I_{n} - H_{(S^{(ℓ)})}) X_{(S)}^{⊤} B_{(S)}} \\ - t r {ϵ^{⊤} (I_{n} - H_{(S^{(ℓ)})})^{⊤} H_{\hat{k}}^{(ℓ)^{⊤}} H_{\hat{k}}^{(ℓ)} (I_{n} - H_{(S^{(ℓ)})}) ϵ} \\ \geq max_{k \in S} t r {B_{(S)}^{⊤} X_{(S)} (I_{n} - H_{(S^{(ℓ)})})^{⊤} H_{k}^{(ℓ)^{⊤}} \\ H_{k}^{(ℓ)} (I_{n} - H_{(S^{(ℓ)})}) X_{(S)}^{⊤} B_{(S)}} \\ - max_{k \in S} t r {ϵ^{⊤} (I_{n} - H_{(S^{(ℓ)})})^{⊤} H_{k}^{(ℓ)^{⊤}} H_{k}^{(ℓ)} (I_{n} - H_{(S^{(ℓ)})}) ϵ}, \end{aligned}$ (A13) ) can be expressed as (A14) $\begin{aligned} max_{k \in S} t r {B_{(S)}^{⊤} X_{(S)} Q_{(S^{(ℓ)})}^{⊤} H_{k}^{(ℓ)^{⊤}} H_{k}^{(ℓ)} Q_{(S^{(ℓ)})} X_{(S)}^{⊤} B_{(S)}} \\ = max_{j \in S} t r {B_{(S)}^{⊤} X_{(S)} Q_{(S^{(ℓ)})}^{⊤} \frac{x_{k}^{(ℓ)} x_{k}^{(ℓ)^{⊤}} x_{k}^{(ℓ)} x_{k}^{(ℓ)^{⊤}}}{‖ x_{k}^{(ℓ)} ‖^{4}} \\ \times Q_{(S^{(ℓ)})} X_{(S)}^{⊤} B_{(S)}} \\ = max_{k \in S} t r {B_{(S)}^{⊤} X_{(S)} Q_{(S^{(ℓ)})}^{⊤} \frac{x_{k}^{(ℓ)} x_{k}^{(ℓ)^{⊤}}}{‖ x_{k}^{(ℓ)} ‖^{2}} Q_{(S^{(ℓ)})} X_{(S)}^{⊤} B_{(S)}} . \end{aligned}$ (A14) Define $\begin{aligned} k^{*} & = {a r g m a x}_{k \in S} t r {(X_{(S)} B_{(S)})^{⊤} Q_{(S^{(ℓ)})}^{⊤} x_{k}^{(ℓ)} x_{k}^{(ℓ)^{⊤}} \\ Q_{(S^{(ℓ)})} X_{(S)} B_{(S)}} . \end{aligned}$ Thus, the RHS of (EquationA14(A14) $\begin{aligned} max_{k \in S} t r {B_{(S)}^{⊤} X_{(S)} Q_{(S^{(ℓ)})}^{⊤} H_{k}^{(ℓ)^{⊤}} H_{k}^{(ℓ)} Q_{(S^{(ℓ)})} X_{(S)}^{⊤} B_{(S)}} \\ = max_{j \in S} t r {B_{(S)}^{⊤} X_{(S)} Q_{(S^{(ℓ)})}^{⊤} \frac{x_{k}^{(ℓ)} x_{k}^{(ℓ)^{⊤}} x_{k}^{(ℓ)} x_{k}^{(ℓ)^{⊤}}}{‖ x_{k}^{(ℓ)} ‖^{4}} \\ \times Q_{(S^{(ℓ)})} X_{(S)}^{⊤} B_{(S)}} \\ = max_{k \in S} t r {B_{(S)}^{⊤} X_{(S)} Q_{(S^{(ℓ)})}^{⊤} \frac{x_{k}^{(ℓ)} x_{k}^{(ℓ)^{⊤}}}{‖ x_{k}^{(ℓ)} ‖^{2}} Q_{(S^{(ℓ)})} X_{(S)}^{⊤} B_{(S)}} . \end{aligned}$ (A14) ) is no less than (A15) $\begin{aligned} ‖ x_{k^{*}}^{(ℓ)} ‖^{- 2} t r {B_{(S)}^{⊤} X_{(S)} Q_{(S^{(ℓ)})}^{⊤} x_{k^{*}}^{(ℓ)} x_{k^{*}}^{(ℓ)^{⊤}} Q_{(S^{(ℓ)})} X_{(S)}^{⊤} B_{(S)}} \\ \geq min_{j \in S} ‖ x_{k}^{(ℓ)} ‖^{- 2} \\ \times t r {B_{(S)}^{⊤} X_{(S)} Q_{(S^{(ℓ)})}^{⊤} x_{k^{*}}^{(ℓ)} x_{k^{*}}^{(ℓ)^{⊤}} Q_{(S^{(ℓ)})} X_{(S)}^{⊤} B_{(S)}} \\ = (max_{k \in S} ‖ x_{k}^{(ℓ)} ‖^{2})^{- 1} \\ \times max_{k \in S} t r {B_{(S)}^{⊤} X_{(S)} Q_{(S^{(ℓ)})}^{⊤} x_{k}^{(ℓ)} x_{k}^{(ℓ)^{⊤}} Q_{(S^{(ℓ)})} X_{(S)}^{⊤} B_{(S)}} \\ \geq (max_{k \in S} ‖ x_{k} ‖^{2})^{- 1} \\ \times max_{k \in S} t r {B_{(S)}^{⊤} X_{(S)} Q_{(S^{(ℓ)})}^{⊤} x_{k}^{(ℓ)} x_{k}^{(ℓ)^{⊤}} Q_{(S^{(ℓ)})} X_{(S)}^{⊤} B_{(S)}}, \end{aligned}$ (A15) where the last inequality follows after the fact $‖ x_{k} ‖ \geq ‖ x_{k}^{(ℓ)} ‖$ .

On the other hand, observe that (A16) $\begin{aligned} t r {B_{(S)}^{⊤} X_{(S)} Q_{(S^{(ℓ)})}^{⊤} Q_{(S^{(ℓ)})} X_{(S)}^{⊤} B_{(S)}} \\ = t r {B_{(S)}^{⊤} X_{(S)} Q_{(S^{(ℓ)})} X_{(S)}^{⊤} B_{(S)}} \\ = \sum_{j = 1}^{q} (\sum_{k \in S} β_{k j} x_{k}^{⊤} Q_{(S^{(ℓ)})} X_{(S)}^{⊤} {\vec{β}}_{j}) \\ \leq \sum_{j = 1}^{q} {(\sum_{k \in S} β_{k j}^{2})}^{1 / 2} {\sum_{k \in S} {(x_{k}^{⊤} Q_{(S^{(ℓ)})} X_{(S)}^{⊤} {\vec{β}}_{j})}^{2}}^{1 / 2} \\ \leq q \times max_{j} ‖ {\vec{β}}_{j} ‖ t r {B_{(S)}^{⊤} X_{(S)} Q_{(S^{(ℓ)})}^{⊤} x_{k}^{(ℓ)} x_{k}^{(ℓ)^{⊤}} \\ {Q_{(S^{(ℓ)})} X_{(S)}^{⊤} B_{(S)}}}^{1 / 2} \times \sqrt{p_{0}} . \end{aligned}$ (A16) Applying (EquationA16(A16) $\begin{aligned} t r {B_{(S)}^{⊤} X_{(S)} Q_{(S^{(ℓ)})}^{⊤} Q_{(S^{(ℓ)})} X_{(S)}^{⊤} B_{(S)}} \\ = t r {B_{(S)}^{⊤} X_{(S)} Q_{(S^{(ℓ)})} X_{(S)}^{⊤} B_{(S)}} \\ = \sum_{j = 1}^{q} (\sum_{k \in S} β_{k j} x_{k}^{⊤} Q_{(S^{(ℓ)})} X_{(S)}^{⊤} {\vec{β}}_{j}) \\ \leq \sum_{j = 1}^{q} {(\sum_{k \in S} β_{k j}^{2})}^{1 / 2} {\sum_{k \in S} {(x_{k}^{⊤} Q_{(S^{(ℓ)})} X_{(S)}^{⊤} {\vec{β}}_{j})}^{2}}^{1 / 2} \\ \leq q \times max_{j} ‖ {\vec{β}}_{j} ‖ t r {B_{(S)}^{⊤} X_{(S)} Q_{(S^{(ℓ)})}^{⊤} x_{k}^{(ℓ)} x_{k}^{(ℓ)^{⊤}} \\ {Q_{(S^{(ℓ)})} X_{(S)}^{⊤} B_{(S)}}}^{1 / 2} \times \sqrt{p_{0}} . \end{aligned}$ (A16) ) to (EquationA15(A15) $\begin{aligned} ‖ x_{k^{*}}^{(ℓ)} ‖^{- 2} t r {B_{(S)}^{⊤} X_{(S)} Q_{(S^{(ℓ)})}^{⊤} x_{k^{*}}^{(ℓ)} x_{k^{*}}^{(ℓ)^{⊤}} Q_{(S^{(ℓ)})} X_{(S)}^{⊤} B_{(S)}} \\ \geq min_{j \in S} ‖ x_{k}^{(ℓ)} ‖^{- 2} \\ \times t r {B_{(S)}^{⊤} X_{(S)} Q_{(S^{(ℓ)})}^{⊤} x_{k^{*}}^{(ℓ)} x_{k^{*}}^{(ℓ)^{⊤}} Q_{(S^{(ℓ)})} X_{(S)}^{⊤} B_{(S)}} \\ = (max_{k \in S} ‖ x_{k}^{(ℓ)} ‖^{2})^{- 1} \\ \times max_{k \in S} t r {B_{(S)}^{⊤} X_{(S)} Q_{(S^{(ℓ)})}^{⊤} x_{k}^{(ℓ)} x_{k}^{(ℓ)^{⊤}} Q_{(S^{(ℓ)})} X_{(S)}^{⊤} B_{(S)}} \\ \geq (max_{k \in S} ‖ x_{k} ‖^{2})^{- 1} \\ \times max_{k \in S} t r {B_{(S)}^{⊤} X_{(S)} Q_{(S^{(ℓ)})}^{⊤} x_{k}^{(ℓ)} x_{k}^{(ℓ)^{⊤}} Q_{(S^{(ℓ)})} X_{(S)}^{⊤} B_{(S)}}, \end{aligned}$ (A15) ) and (EquationA14(A14) $\begin{aligned} max_{k \in S} t r {B_{(S)}^{⊤} X_{(S)} Q_{(S^{(ℓ)})}^{⊤} H_{k}^{(ℓ)^{⊤}} H_{k}^{(ℓ)} Q_{(S^{(ℓ)})} X_{(S)}^{⊤} B_{(S)}} \\ = max_{j \in S} t r {B_{(S)}^{⊤} X_{(S)} Q_{(S^{(ℓ)})}^{⊤} \frac{x_{k}^{(ℓ)} x_{k}^{(ℓ)^{⊤}} x_{k}^{(ℓ)} x_{k}^{(ℓ)^{⊤}}}{‖ x_{k}^{(ℓ)} ‖^{4}} \\ \times Q_{(S^{(ℓ)})} X_{(S)}^{⊤} B_{(S)}} \\ = max_{k \in S} t r {B_{(S)}^{⊤} X_{(S)} Q_{(S^{(ℓ)})}^{⊤} \frac{x_{k}^{(ℓ)} x_{k}^{(ℓ)^{⊤}}}{‖ x_{k}^{(ℓ)} ‖^{2}} Q_{(S^{(ℓ)})} X_{(S)}^{⊤} B_{(S)}} . \end{aligned}$ (A14) ), using the fact that $max_{k \in S} ‖ x_{k} ‖^{2} / n \leq τ_{max}$ with probability tending to one, and by Assumptions 3.1, 4.1 and Lemma 1 of H. Wang (Citation2009), we obtain (A17) $\begin{aligned} max_{k \in S} t r {B_{(S)}^{⊤} X_{(S)} Q_{(S^{(ℓ)})}^{⊤} H_{k}^{(ℓ)^{⊤}} H_{k}^{(ℓ)} Q_{(S^{(ℓ)})} X_{(S)}^{⊤} B_{(S)}} \\ \geq \frac{{[t r {B_{(S)}^{⊤} X_{(S)} Q_{(S^{(ℓ)})} X_{(S)}^{⊤} B_{(S)}}]}^{2}}{q^{2} n τ_{max} p_{0} \times {max}_{j} ‖ {\vec{β}}_{j} ‖^{2}} . \end{aligned}$ (A17) Define $ζ_{(S^{(ℓ)})} = (X_{(S^{(ℓ)})}^{⊤} X_{(S^{(ℓ)})})^{- 1} X_{(S)}^{⊤} B_{(S)}$ . We have $\begin{aligned} t r {B_{(S)}^{⊤} X_{(S)} Q_{(S^{(ℓ)})} X_{(S)}^{⊤} B_{(S)}} \\ = t r (B_{(S)}^{⊤} X_{(S)} X_{(S)}^{⊤} B_{(S)} - ζ_{(S^{(ℓ)})}^{⊤} X_{(S^{(ℓ)})}^{⊤} X_{(S^{(ℓ)})} ζ_{(S^{(ℓ)})}) . \end{aligned}$ Recall the assumption at the beginning of the proof that $S ⊄ S^{(k)}$ . Then, by Assumptions 3.1, 3.2 and 4.1, and Lemma 1 of H. Wang (Citation2009), we get (A18) $t r {B_{(S)}^{⊤} X_{(S)} Q_{(S^{(ℓ)})} X_{(S)}^{⊤} B_{(S)}} \geq q n τ_{min} β_{min}^{2}$ (A18) with probability tending to one. Applying (EquationA18(A18) $t r {B_{(S)}^{⊤} X_{(S)} Q_{(S^{(ℓ)})} X_{(S)}^{⊤} B_{(S)}} \geq q n τ_{min} β_{min}^{2}$ (A18) ) to (EquationA17(A17) $\begin{aligned} max_{k \in S} t r {B_{(S)}^{⊤} X_{(S)} Q_{(S^{(ℓ)})}^{⊤} H_{k}^{(ℓ)^{⊤}} H_{k}^{(ℓ)} Q_{(S^{(ℓ)})} X_{(S)}^{⊤} B_{(S)}} \\ \geq \frac{{[t r {B_{(S)}^{⊤} X_{(S)} Q_{(S^{(ℓ)})} X_{(S)}^{⊤} B_{(S)}}]}^{2}}{q^{2} n τ_{max} p_{0} \times {max}_{j} ‖ {\vec{β}}_{j} ‖^{2}} . \end{aligned}$ (A17) ) and using Assumptions 3.2, 4.1 and 4.2, we obtain (A19) $\begin{aligned} max_{k \in S} t r {B_{(S)}^{⊤} X_{(S)} Q_{(S^{(ℓ)})}^{⊤} H_{k}^{(ℓ)^{⊤}} H_{k}^{(ℓ)} Q_{(S^{(ℓ)})} X_{(S)}^{⊤} B_{(S)}} \\ \geq n τ_{max}^{- 1} p_{0}^{- 1} (max_{j} ‖ {\vec{β}}_{j} ‖)^{- 2} τ_{min}^{2} β_{min}^{4} \\ \geq τ_{max}^{- 1} ν^{- 1} C_{B}^{- 2} τ_{min}^{2} ν_{B}^{4} n^{1 - ξ_{0} - 4 ξ_{min}} . \end{aligned}$ (A19) Step 2: The second term of (EquationA13(A13) $\begin{aligned} t r {B_{(S)}^{⊤} X_{(S)} (I_{n} - H_{(S^{(ℓ)})})^{⊤} H_{\hat{k}}^{(ℓ)^{⊤}} H_{\hat{k}}^{(ℓ)} (I_{n} - H_{(S^{(ℓ)})}) X_{(S)}^{⊤} B_{(S)}} \\ - t r {ϵ^{⊤} (I_{n} - H_{(S^{(ℓ)})})^{⊤} H_{\hat{k}}^{(ℓ)^{⊤}} H_{\hat{k}}^{(ℓ)} (I_{n} - H_{(S^{(ℓ)})}) ϵ} \\ \geq max_{k \in S} t r {B_{(S)}^{⊤} X_{(S)} (I_{n} - H_{(S^{(ℓ)})})^{⊤} H_{k}^{(ℓ)^{⊤}} \\ H_{k}^{(ℓ)} (I_{n} - H_{(S^{(ℓ)})}) X_{(S)}^{⊤} B_{(S)}} \\ - max_{k \in S} t r {ϵ^{⊤} (I_{n} - H_{(S^{(ℓ)})})^{⊤} H_{k}^{(ℓ)^{⊤}} H_{k}^{(ℓ)} (I_{n} - H_{(S^{(ℓ)})}) ϵ}, \end{aligned}$ (A13) ). Recall $x_{k}^{(ℓ)} = x_{k} - H_{(S^{(ℓ)})} x_{k} = x_{k} - X_{(S^{(ℓ)})} θ_{k (S^{(ℓ)})}$ where $θ_{k (S^{(ℓ)})} = {(X_{(S^{(ℓ)})}^{⊤} X_{(S^{(ℓ)})})}^{- 1} X_{(S^{(ℓ)})}^{⊤} x_{k} .$ Then, by Assumptions 3.1 and 3.2 and Lemma 1 of H. Wang (Citation2009), we have $‖ x_{k}^{(ℓ)} ‖^{2} \geq n τ_{min}$ .

Moreover, since $x_{k}^{(ℓ)} = (I_{n} - H_{(S^{(ℓ)})}) x_{k}$ , we have (A20) $\begin{aligned} t r {{\vec{ϵ}}^{⊤} (I_{n} - H_{(S^{(ℓ)})})^{⊤} H_{k}^{(ℓ)^{⊤}} H_{k}^{(ℓ)} (I_{n} - H_{(S^{(ℓ)})}) \vec{ϵ}} \\ = ‖ x_{k}^{(ℓ)} ‖^{- 2} t r ({\vec{ϵ}}^{⊤} x_{k} x_{k}^{⊤} \vec{ϵ} - {\vec{ϵ}}^{⊤} H_{(S^{(ℓ)})}^{⊤} x_{k} x_{k}^{⊤} H_{(S^{(ℓ)})} \vec{ϵ}) \\ \leq τ_{min}^{- 1} n^{- 1} t r ({\vec{ϵ}}^{⊤} x_{k} x_{k}^{⊤} \vec{ϵ} - {\vec{ϵ}}^{⊤} H_{(S^{(ℓ)})}^{⊤} x_{k} x_{k}^{⊤} H_{(S^{(ℓ)})} \vec{ϵ}) \\ = τ_{min}^{- 1} n^{- 1} t r (ϵ^{⊤} Q_{(S^{(ℓ)})} x_{k} x_{k}^{⊤} Q_{(S^{(ℓ)})} \vec{ϵ}) \\ \leq τ_{min}^{- 1} n^{- 1} max_{k \in S} max_{| M | \leq m^{*}} t r ({\vec{ϵ}}^{⊤} Q_{(M)} x_{k} x_{k}^{⊤} Q_{(M)} \vec{ϵ}) \\ = τ_{min}^{- 1} n^{- 1} max_{k \in S} max_{| M | \leq m^{*}} \sum_{j = 1}^{q} ({\vec{ϵ}}_{j}^{⊤} Q_{(M)} x_{k} x_{k}^{⊤} Q_{(M)} {\vec{ϵ}}_{j}), \end{aligned}$ (A20) where $m^{*} = K ν n^{2 ξ_{0} + 4 ξ_{min}}$ .

Notice that $x_{k}^{⊤} Q_{(M)} {\vec{ϵ}}_{j}$ is a normal random variable with mean zero and variance given by $‖ Q_{(M)} x_{k} ‖^{2} \leq ‖ x_{k} ‖^{2}$ . Thus, the RHS of (EquationA20(A20) $\begin{aligned} t r {{\vec{ϵ}}^{⊤} (I_{n} - H_{(S^{(ℓ)})})^{⊤} H_{k}^{(ℓ)^{⊤}} H_{k}^{(ℓ)} (I_{n} - H_{(S^{(ℓ)})}) \vec{ϵ}} \\ = ‖ x_{k}^{(ℓ)} ‖^{- 2} t r ({\vec{ϵ}}^{⊤} x_{k} x_{k}^{⊤} \vec{ϵ} - {\vec{ϵ}}^{⊤} H_{(S^{(ℓ)})}^{⊤} x_{k} x_{k}^{⊤} H_{(S^{(ℓ)})} \vec{ϵ}) \\ \leq τ_{min}^{- 1} n^{- 1} t r ({\vec{ϵ}}^{⊤} x_{k} x_{k}^{⊤} \vec{ϵ} - {\vec{ϵ}}^{⊤} H_{(S^{(ℓ)})}^{⊤} x_{k} x_{k}^{⊤} H_{(S^{(ℓ)})} \vec{ϵ}) \\ = τ_{min}^{- 1} n^{- 1} t r (ϵ^{⊤} Q_{(S^{(ℓ)})} x_{k} x_{k}^{⊤} Q_{(S^{(ℓ)})} \vec{ϵ}) \\ \leq τ_{min}^{- 1} n^{- 1} max_{k \in S} max_{| M | \leq m^{*}} t r ({\vec{ϵ}}^{⊤} Q_{(M)} x_{k} x_{k}^{⊤} Q_{(M)} \vec{ϵ}) \\ = τ_{min}^{- 1} n^{- 1} max_{k \in S} max_{| M | \leq m^{*}} \sum_{j = 1}^{q} ({\vec{ϵ}}_{j}^{⊤} Q_{(M)} x_{k} x_{k}^{⊤} Q_{(M)} {\vec{ϵ}}_{j}), \end{aligned}$ (A20) ) is no greater than (A21) $q τ_{min}^{- 1} n^{- 1} max_{k \in S} ‖ x_{k} ‖^{2} max_{k \in S} max_{| M | \leq m^{*}} χ_{1}^{2},$ (A21) where $χ_{1}^{2}$ stands for a chi-square random variable with one degree of freedom. By Assumptions 3.1 and 4.1, and Lemma 1 of H. Wang (Citation2009), we get that $n^{- 1} max_{k \in S} ‖ x_{k} ‖^{2} \leq τ_{max}$ with probability tending to one.

On the other hand, the total number of combinations for $k \in S$ and $| M | \leq m^{*}$ is no more than $d^{m^{*} + 2}$ . Then, by Assumption 4.2, we get $\begin{aligned} max_{k \in S} max_{| M | = ℓ} χ_{1}^{2} \\ \leq 2 (m^{*} + 2) \log (p) \\ \leq 3 K ν n^{2 ξ_{0} + 4 ξ_{min}} \times ν n^{ξ} = 3 K ν^{2} n^{ξ + 2 ξ_{0} + 4 ξ_{min}} \end{aligned}$ with probability tending to one. Therefore, (EquationA21(A21) $q τ_{min}^{- 1} n^{- 1} max_{k \in S} ‖ x_{k} ‖^{2} max_{k \in S} max_{| M | \leq m^{*}} χ_{1}^{2},$ (A21) ) is bounded by (A22) $q τ_{min}^{- 1} τ_{max} 3 K ν^{2} n^{ξ + 2 ξ_{0} + 4 ξ_{min} - 1} .$ (A22) Combining (EquationA13(A13) $\begin{aligned} t r {B_{(S)}^{⊤} X_{(S)} (I_{n} - H_{(S^{(ℓ)})})^{⊤} H_{\hat{k}}^{(ℓ)^{⊤}} H_{\hat{k}}^{(ℓ)} (I_{n} - H_{(S^{(ℓ)})}) X_{(S)}^{⊤} B_{(S)}} \\ - t r {ϵ^{⊤} (I_{n} - H_{(S^{(ℓ)})})^{⊤} H_{\hat{k}}^{(ℓ)^{⊤}} H_{\hat{k}}^{(ℓ)} (I_{n} - H_{(S^{(ℓ)})}) ϵ} \\ \geq max_{k \in S} t r {B_{(S)}^{⊤} X_{(S)} (I_{n} - H_{(S^{(ℓ)})})^{⊤} H_{k}^{(ℓ)^{⊤}} \\ H_{k}^{(ℓ)} (I_{n} - H_{(S^{(ℓ)})}) X_{(S)}^{⊤} B_{(S)}} \\ - max_{k \in S} t r {ϵ^{⊤} (I_{n} - H_{(S^{(ℓ)})})^{⊤} H_{k}^{(ℓ)^{⊤}} H_{k}^{(ℓ)} (I_{n} - H_{(S^{(ℓ)})}) ϵ}, \end{aligned}$ (A13) ), (EquationA19(A19) $\begin{aligned} max_{k \in S} t r {B_{(S)}^{⊤} X_{(S)} Q_{(S^{(ℓ)})}^{⊤} H_{k}^{(ℓ)^{⊤}} H_{k}^{(ℓ)} Q_{(S^{(ℓ)})} X_{(S)}^{⊤} B_{(S)}} \\ \geq n τ_{max}^{- 1} p_{0}^{- 1} (max_{j} ‖ {\vec{β}}_{j} ‖)^{- 2} τ_{min}^{2} β_{min}^{4} \\ \geq τ_{max}^{- 1} ν^{- 1} C_{B}^{- 2} τ_{min}^{2} ν_{B}^{4} n^{1 - ξ_{0} - 4 ξ_{min}} . \end{aligned}$ (A19) ) and (EquationA22(A22) $q τ_{min}^{- 1} τ_{max} 3 K ν^{2} n^{ξ + 2 ξ_{0} + 4 ξ_{min} - 1} .$ (A22) ), we have (A23) $\begin{aligned} n^{- 1} Ω (ℓ) & \geq τ_{max}^{- 1} ν^{- 1} C_{B}^{- 2} τ_{min}^{2} ν_{B}^{4} n^{1 - ξ_{0} - 4 ξ_{min}} \\ - q τ_{min}^{- 1} τ_{max} 3 K ν^{2} n^{ξ + 2 ξ_{0} + 4 ξ_{min} - 1} \\ = τ_{max}^{- 1} ν^{- 1} C_{B}^{- 2} τ_{min}^{2} ν_{B}^{4} n^{1 - ξ_{0} - 4 ξ_{min}} \\ \times (1 - q τ_{max}^{2} ν^{3} C_{B}^{2} τ_{max}^{- 3} ν_{B}^{- 4} 3 K n^{ξ + 3 ξ_{0} + 8 ξ_{min} - 1}) \end{aligned}$ (A23) uniformly for every $ℓ \leq K n^{ξ_{0} + 4 ξ_{min}}$ . Recall $K = 2 τ_{max} ν C_{B}^{2} τ_{min}^{- 2} ν_{B}^{- 4}$ defined in Section 4. Then, by Assumption 4.2 and 4.3, we have (A24) $\begin{aligned} n^{- 1} ‖ Y ‖_{F}^{2} & \geq n^{- 1} \sum_{ℓ = 1}^{K n^{ξ_{0}} + 4 ξ_{min}} Ω (ℓ) \\ \geq 2 (1 - q τ_{max}^{2} ν^{3} C_{B}^{2} τ_{max}^{- 3} ν_{B}^{- 4} 3 K n^{ξ + 3 ξ_{0} + 8 ξ_{min} - 1}) \\ \overset{p}{\to} 2, \end{aligned}$ (A24) where $‖ \cdot ‖_{F}$ is the Frobenius norm.

Without loss of generality, we can assume $var (y_{i 1}) + \dots + var (y_{i q}) = 1$ , and we have $n^{- 1} ‖ Y ‖_{F}^{2} \overset{p}{\to} 1$ . (Otherwise, we can standardize it by letting $y_{i k}^{*} = y_{i k} / \sqrt{var (y_{i 1}) + \dots + var (y_{i q})}$ for $i = 1, \dots, n$ and $k = 1, \dots, q$ .) Thus, it is impossible to have $S^{(k)} ⋃ M_{t} = \emptyset$ for every $1 \leq k \leq K n^{ξ_{0} + 4 ξ_{min}}$ , which implies that at least one relevant variable will be discovered within $K n^{ξ_{0} + 4 ξ_{min}}$ steps. This completes the proof.

Variable screening in multivariate linear regression with high-dimensional covariates

Abstract

1. Introduction

2. Notation and assumptions

3. Multiple correlation coefficient

4. Unified forward regression

5. Simulation

5.1. Models

Table 1. Five models.

5.2. Evaluation criteria

Table 2. Measures for the finite sample performance of variable selection.

5.3. Results

Table 3. Five measures of the performance of variable selection defined in Table obtained by the four competing methods under various numbers of covariates (p) and signal-to-noise ratio ( $R^{2}$ ) for Model 1 in Table with $(n, q, p_{0}) = (200, 4, 8)$ .

Table 4. Five measures of the performance of variable selection defined in Table obtained by the four competing methods under various numbers of covariates (p) and signal to noise ratio ( $R^{2}$ ) for Model 2 in Table with $(n, q, p_{0}) = (75, 5, 3)$ .

Table 5. Five measures of the performance of variable selection defined in Table obtained by the four competing methods under various numbers of covariates (p) and signal-to-noise ratio ( $R^{2}$ ) for Model 3 in Table with $(n, q, p_{0}) = (200, 3, 3)$ .

Table 6. Five measures of the performance of variable selection defined in Table obtained by the four competing methods under various numbers of covariates (p) and signal to noise ratio ( $R^{2}$ ) for Model 4 in Table with $(n, q, p_{0}) = (300, 2, 5)$ .

Table 7. Five measures of the performance of variable selection defined in Table obtained by the four competing methods under various numbers of covariates (p), and signal-to-noise ratio ( $R^{2}$ ) for Model 5 in Table with $(n, q, p_{0}) = (200, 6, 10)$ .

Table 8. Five measures of the performance of variable selection defined in Table obtained by the four competing methods under various sample sizes (n) and signal-to-noise ratio ( $R^{2}$ ) for Model 1 in Table with $(p, q, p_{0}) = (5000, 4, 8)$ .

6. Real data application

Table 9. Selected genes for the BMD data.

7. Conclusion

Disclosure statement

References

Appendix

A.1. Proof of Theorem 3.1

A.2. A lemma used to prove theorem 4.1

A.3. Proof of theorem 4.1

Information for

Open access

Opportunities

Help and information

Variable screening in multivariate linear regression with high-dimensional covariates

Abstract

1. Introduction

2. Notation and assumptions

3. Multiple correlation coefficient

4. Unified forward regression

5. Simulation

5.1. Models

Table 1. Five models.

5.2. Evaluation criteria

Table 2. Measures for the finite sample performance of variable selection.

5.3. Results

Table 3. Five measures of the performance of variable selection defined in Table 2 obtained by the four competing methods under various numbers of covariates (p) and signal-to-noise ratio (R2) for Model 1 in Table 1 with (n,q,p0)=(200,4,8).

Table 4. Five measures of the performance of variable selection defined in Table 2 obtained by the four competing methods under various numbers of covariates (p) and signal to noise ratio (R2) for Model 2 in Table 1 with (n,q,p0)=(75,5,3).

Table 5. Five measures of the performance of variable selection defined in Table 2 obtained by the four competing methods under various numbers of covariates (p) and signal-to-noise ratio (R2) for Model 3 in Table 1 with (n,q,p0)=(200,3,3).

Table 6. Five measures of the performance of variable selection defined in Table 2 obtained by the four competing methods under various numbers of covariates (p) and signal to noise ratio (R2) for Model 4 in Table 1 with (n,q,p0)=(300,2,5).

Table 7. Five measures of the performance of variable selection defined in Table 2 obtained by the four competing methods under various numbers of covariates (p), and signal-to-noise ratio (R2) for Model 5 in Table 1 with (n,q,p0)=(200,6,10).

Table 8. Five measures of the performance of variable selection defined in Table 2 obtained by the four competing methods under various sample sizes (n) and signal-to-noise ratio (R2) for Model 1 in Table 1 with (p,q,p0)=(5000,4,8).

6. Real data application

Table 9. Selected genes for the BMD data.

7. Conclusion

Disclosure statement

References

Appendix

A.1. Proof of Theorem 3.1

A.2. A lemma used to prove theorem 4.1

A.3. Proof of theorem 4.1

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date

Table 3. Five measures of the performance of variable selection defined in Table obtained by the four competing methods under various numbers of covariates (p) and signal-to-noise ratio ( $R^{2}$ ) for Model 1 in Table with $(n, q, p_{0}) = (200, 4, 8)$ .

Table 4. Five measures of the performance of variable selection defined in Table obtained by the four competing methods under various numbers of covariates (p) and signal to noise ratio ( $R^{2}$ ) for Model 2 in Table with $(n, q, p_{0}) = (75, 5, 3)$ .

Table 5. Five measures of the performance of variable selection defined in Table obtained by the four competing methods under various numbers of covariates (p) and signal-to-noise ratio ( $R^{2}$ ) for Model 3 in Table with $(n, q, p_{0}) = (200, 3, 3)$ .

Table 6. Five measures of the performance of variable selection defined in Table obtained by the four competing methods under various numbers of covariates (p) and signal to noise ratio ( $R^{2}$ ) for Model 4 in Table with $(n, q, p_{0}) = (300, 2, 5)$ .

Table 7. Five measures of the performance of variable selection defined in Table obtained by the four competing methods under various numbers of covariates (p), and signal-to-noise ratio ( $R^{2}$ ) for Model 5 in Table with $(n, q, p_{0}) = (200, 6, 10)$ .

Table 8. Five measures of the performance of variable selection defined in Table obtained by the four competing methods under various sample sizes (n) and signal-to-noise ratio ( $R^{2}$ ) for Model 1 in Table with $(p, q, p_{0}) = (5000, 4, 8)$ .