![MathJax Logo](/templates/jsp/_style2/_tandf/pb2/images/math-jax.gif)
Abstract
We propose two variable selection methods in multivariate linear regression with high-dimensional covariates. The first method uses a multiple correlation coefficient to fast reduce the dimension of the relevant predictors to a moderate or low level. The second method extends the univariate forward regression of Wang [(2009). Forward regression for ultra-high dimensional variable screening. Journal of the American Statistical Association, 104(488), 1512–1524. https://doi.org/10.1198/jasa.2008.tm08516] in a unified way such that the variable selection and model estimation can be obtained simultaneously. We establish the sure screening property for both methods. Simulation and real data applications are presented to show the finite sample performance of the proposed methods in comparison with some naive method.
1. Introduction
High-dimensional multivariate regression has been widely applied in bioinformatics, chemometrics, and medical image analysis where many of the response variables are highly correlated (Cai et al., Citation2013; Ferte et al., Citation2013; Jia et al., Citation2017; Peng et al., Citation2010; Smith & Fahrmeir, Citation2007). For instance, in genetics study, we are interested in the association between correlated phenotypes (involved in biological pathways) and genotypes, as genetic effects and their possible interaction have been recognized as an important component for the genetic architecture of each complex phenotype (Yi, Citation2010). For this kind of problem, the number of covariates or explanatory variables is much larger than the number of observations or samples. Traditional methods of subset selection and stepwise procedure become infeasible when confronted with high dimensionality (Breiman, Citation1995).
Statistical methods and theories have been developed to solve this problem through various approaches such as network-based regularization method (C. Li & Li, Citation2008; Ren et al., Citation2019, Citation2017), graphical model (B. Li, Chuns et al., Citation2012; Yin & Li, Citation2011), correlation-based screening (B. Li, Chuns et al., Citation2012; Song et al., Citation2016) and group lasso (Y. Li et al., Citation2015; J. Wang et al., Citation2019; Yang & Zou, Citation2015).
Variable selection methods for regression models with a univariate response have been proposed in the past. Some popular methods include the bridge regression (Frank & Friedman, Citation1993; Fu, Citation1998), LASSO (Tibshirani, Citation1996), SCAD (Fan & Li, Citation2001), LARS (Efron et al., Citation2004), elastic net (Zou & Hastie, Citation2005), adaptive LASSO (H. H. Zhang & Lu, Citation2007; Zou, Citation2006), and Dantzig selector (Candes & Tao, Citation2007; Y. Kong et al., Citation2016), among others. On the other hand, variable screening procedures have been developed to reduce the dimensionality from an ultrahigh dimension to a lower dimension which is smaller than the sample size (Fan & Lv, Citation2008; X. Kong et al., Citation2017; G. Li, Peng et al., Citation2012; H. Wang, Citation2009; Zhu et al., Citation2011).
For variable selection under multivariate regression models, one simple approach is to apply some variable selection method to univariate regression of each response separately. Such an approach may produce sub-optimal results since it does not utilize the joint information among the responses (Breiman & Friedman, Citation1997; Kim et al., Citation2009). To improve the estimation, various attempts have been made. One approach is to use dimension reduction techniques such as the reduced rank regression (Chen & Huang, Citation2012; He et al., Citation2018; Zhao et al., Citation2017) and the sliced inverse regression (Setdji & Cook, Citation2004; N. Zhang et al., Citation2019). Another approach is to use a block-structured regularization method to select a subset which can be used as predictors for all outcome variables (Obozinski et al., Citation2011; Peng et al., Citation2010; Turlach et al., Citation2005). The latter approach assumes that a covariate affects either all or none of the responses. However, this assumption may be too strong when each response variable is affected by different sets of predictors. Rothman et al. (Citation2010) proposed a penalized framework to estimate multivariate regression coefficient and covariance matrix simultaneously under penalty. Lee and Liu (Citation2012) further improved Rothman et al.'s (Citation2010) work by using a weighted
regularization. Cai et al. (Citation2013) proposed a method to first estimate the regression coefficients in a column-wise fashion with Dantzig selector and then to estimate the precision matrix by solving a constrained
minimization problem.
In high-dimensional setting, most of the aforementioned multivariate regression methods use the technique of regularization to estimate the regression coefficient matrix (Obozinski et al., Citation2011; Peng et al., Citation2010; Turlach et al., Citation2005). However, a well-chosen penalty requires an efficient exploration of the correlation structure of the responses. It is reported that simultaneously estimating covariance and selecting variables via joint optimization can be numerically unstable in high-dimensional cases (Deshpande et al., Citation2019; Pecanka et al., Citation2019; Ren et al., Citation2019).
In this study, we propose two methods in parallel for variable screening and variable selection, namely the multiple correlation coefficient (MCC) screening (Section 3) and the unified forward regression (UFR) (Section 4). The first method is for dimension reduction which filters out covariates that have weak correlation with the response variables. It significantly reduces the feature space to a moderate or low dimension that covers the set of relevant predictors almost certainly. The second method is for variable selection which uses an extended forward regression (FR) (H. Wang, Citation2009) to identify all relevant predictors consistently under mild conditions. By MCC all relevant predictors are identified or screened, whereas by UFR both variable selection and model estimation are obtained. We illustrate the finite sample performance of the proposed methods in comparison with a naive method by simulation (Section 5) and a real data application (Section 6). We conclude the paper in Section 7 and defer the technical proofs in Appendix.
2. Notation and assumptions
Let denote the q-dimensional response vector of interest. Let
denote the p-dimensional covariates or predictors. Denote the covariance matrices of
and
by
and
, respectively. Without loss of generality, assume that
and
for
and that
for
. In practice, these can be achieved by standardization and centralization.
Consider the multivariate linear regression model
(1)
(1) where B is a
matrix of coefficients and
is the random error vector which is independent with
. For
and
, denote
as the jth column vector of B and
as the kth row vector of B. If
,
is referred to as a relevant predictor.
Let denote the full model of predictors. Let
denote the true model. Denote the compliment of S by
. Denote the cardinalities of F and S as
and
, respectively. Throughout, let
denote the Euclidean norm of a vector.
Let denote independent and identically distributed samples of
. Denote
and
. For
, let
denote the jth column of Y.
Assume that is high dimensional with p being much larger than the sample size n (in the sense of Cai & Lv, Citation2007). Assume that the response vector is associated with only a small portion of predictors, i.e.,
is small and
is
(Fan & Lv, Citation2008). This sparsity principle is frequently adopted and deemed useful in analysis.
3. Multiple correlation coefficient
We first propose to use a multiple correlation coefficient (MCC) to identify S. It is known that the multiple correlation coefficient between and
is defined as
and its square can be further expressed as
(2)
(2) where
(Anderson, Citation2003, Section 12.2).
Given the standardized samples, we estimate by
(3)
(3) where
. Note that the computation of
is simple and fast through matrix algebra and does not involve any iteration. Then, we estimate S by
, where τ is the threshold which determines the size of the estimated predictors. Here we adopt the threshold of Fan and Lv (Citation2008) by choosing
, where
are the order statistics and
(
is the ceiling function), so that
predictors with the largest values of
are retained. The naive correlation coefficient (NCC) method of Fan and Lv (Citation2008) estimates S by
where
is the sample correlation coefficient between
and
and
is determined in the same way as in MCC with respect to the jth response.
We now show that the MCC-based screening procedure has the sure screening property (i.e., the probability of selecting all true relevant predictors tends to one) and reduces the dimensionality of predictors below the sample size.
We state some assumptions first.
Assumption 3.1
Let and
denote the smallest and largest eigenvalue of a positive definite matrix A, respectively. Assume that there exist two positive constants
such that
and
Assumption 3.2
Assume that (i) for ,
for some positive constant
and that (ii) for
,
for some positive constants
and
.
Assumption 3.3
Assume that there exist positive constants , K such that (i)
, and (ii)
for
and all
.
Assumption 3.1 requires the matrix X to be well behaved. Assumption 3.2 requires the smallest nonzero regression coefficient does not converge too fast. Otherwise, it cannot be consistently identified. (See Fan & Peng, Citation2004 for more discussions.) Assumption 3.3 ensures the exponential convergence rate of arbitrary order moments of and
(Cai et al., Citation2011) which is superior to the polynomial type counterpart (Ravikumar et al., Citation2010).
Theorem 3.1
Under Assumptions 3.1–3.3, if for all
, then
as
.
Theorem 3.1 reveals that for a properly chosen threshold τ, the probability that MCC detects all relevant predictors tends to one.
4. Unified forward regression
In this section, we propose a unified forward regression (UFR) for variable selection. It extends Wang's (Citation2009) forward regression method for the multivariate response case.
Let denote a generic subset of F with
. Denote
and denote
as the subset of X corresponding to M. We first describe a naive forward regression (NFR) method that combines the selected variables obtained by repeatedly applying Wang's (Citation2009) forward regression method to univariate regressions with respect to every response. The procedure is summarized as follows. Initially, set
for
. Perform forward regression with respect to the jth response by iterating the following two steps for
.
For every
, let
. Compute the sum square of residuals
, where
. Let
.
Update
.
The solution path of NFR is obtained by .
Next, we propose the unified forward regression to select predictors by applying a modified forward regression algorithm that makes use of all response variables simultaneously. The procedure is modified from the previous one as follows. Initially, set . Perform a modified forward regression by iterating the following two steps for
.
For every
, let
. Compute the sum square of residuals
, where
. Let
.
Update
.
The solution path of UFR is obtained by . Notice that both NFR and UFR terminate automatically after n iterations. It is seen that the UFR algorithm makes use of all response variables simultaneously by the trace operator. It has nearly one qth computation cost of NFR.
We show that the proposed UFR method also possesses the sure screening property. Also, we add a few more assumptions to facilitate the development of the theory.
Assumption 4.1
Assume that (i) follows elliptically contoured distribution, whose density admits the form
with
and
, denoted by
, and that (ii) the distribution of
is normal.
Assumption 4.2
There exist positive constants ξ, and ν such that (i)
, (ii)
, and (iii)
.
Assumption 4.3
The row vectors of B, i.e, ,
, have the same ‘all-or-nothing’ structure, i.e., the entries of
are either all zero or none-zero.
Usually, the normality assumption of is imposed to facilitate theory development (Fan & Lv, Citation2008; H. Wang, Citation2009). Here in Assumption 4.1, we relax it to elliptically contoured distribution and show its sufficiency to obtain Lemma 1 of H. Wang (Citation2009) in Appendix. Assumption 4.1, together with Assumption 3.1, ensures the sparse Riesz assumption (C. Zhang & Huang, Citation2008) to derive some key inequalities in proving Theorem 4.1. Assumption 4.2 has been popularly assumed in the literature of ultra-high dimensional inference (Fan & Lv, Citation2008; H. Wang, Citation2009). It implies that the dimension of the covariates diverges to infinity at an exponential rate (Fan & Lv, Citation2008). Assumption 4.3 implies that all responses are associated with the same covariates (Turlach et al., Citation2005). It warrants the row-wise selection of B by UFR in contrast to the element-wise selection by NFR, which enables UFR to reach the sure screening property in fewer steps than NFR.
Define , where the factors are defined in Assumptions 3.1, 3.2 and 4.2. Applying Theorem 1 of H. Wang (Citation2009), we can readily get
, i.e., the NFR selects all relevant predictors with high probability after
steps for the multivariate regression setting. While the following theorem shows that the UFR can do the job in much fewer (one qth) steps.
Theorem 4.1
Under Assumptions 3.1–4.3, as
.
We adopt the BIC criteria to select the best subset of variables from a solution path (Liang et al., Citation2012; H. Wang, Citation2009). Let
(4)
(4) where
. We then choose the subset of the variables from the solution path which minimizes
. The selection consistency was showed by H. Wang (Citation2009) and Sofer et al.(Citation2014).
Note that the UFR method is consistent if for some
, while the MCC method works with
. In this sense, the MCC method can handle higher dimensional variable screening than the UFR method. Secondly, the UFR method is computationally more expensive than the MCC method as the former involves n−1 more times of matrix inversion operation than the latter. On the other hand, when p and n are of the same order, both MCC and UFR perform well in terms of coverage probability and UFR performs better in yielding a parsimonious model with high specificity in terms of model size and correct fit (defined later) as seen in simulation.
5. Simulation
We conduct numerical studies to investigate the finite sample performance of the proposed methods, i.e., MCC and UFR, in comparison with the naive correlation coefficient (NCC) method and the naive forward regression (NFR).
5.1. Models
Consider five models for generating the p-dimensional covariates in Table , which are adopted from Examples 1 and 2 of Fan and Lv (Citation2008), Example 1 of Tibshirani (Citation1996), and Examples 4 and 5 of H. Wang (Citation2009), respectively.
Table 1. Five models.
For models 1 to 3, follows a multivariate normal distribution with zero mean vector and covariance matrix
of the structure of identity, autoregressive and compound symmetry, respectively. In model 4,
is generated by
for
and
for
, where
and
are independent standard normal variables (H. Wang, Citation2009). Note that model 4 is a challenging case as the correlation coefficient of the relevant predictors and the response variables are much smaller than the correlation coefficient of irrelevant predictors and the response variables. The details of
are provided in Table . In model 5, we generate independent components of both
and
to be e−1 where e is an exponential random variable with parameter one. This model is used to examine the robustness of the proposed methods against the departure from the normality assumption. Consider the number of predictors p to be 1000, 5000, and 10,000, respectively, which are all much larger than the sample sizes considered in the five models. Recall
is the number of relevant predictors. Denote the first
rows of B by
. We generate independent entries of
from distributions given in the last column of Table , where
is a normal random variable with mean
and variance 1,
denotes a random variable of gamma distribution with shape parameter 2 and scale parameter 1, and
is an exponential random variable with parameter 9. They are all independent with
. Set the remaining entries (the last
rows) of B to be zero.
For the multivariate response case, the signal-to-noise ratio is given by
We chose the values of
such that the signal-to-noise ratios are
,
, and
, respectively.
Throughout, set the number of replications N to be 1000.
5.2. Evaluation criteria
For MCC screening, we use Fan and Lv's (Citation2008) hard threshold method to retain the relevant predictors. For both NFR and UFR, we use the BIC criterion (Equation4(4)
(4) ) to determine the relevant predictors.
We adopt five measures as described in Table to evaluate the finite sample performance of the proposed methods, where the model size (MS) is the number of the selected relevant predictors, the coverage probability (CP) measures how likely all the relevant predictors are identified, the percentage of correctly fitted (CF) measures the capability in identifying the true model correctly, the correct zero (CZ) characterizes the capability in producing sparse solution, and the incorrect zero (IZ) characterizes the method's under-fitting effects. Ideally, we wish a method to have MS close to , CP, CF, CZ all close to 100% and IZ close to zero.
Table 2. Measures for the finite sample performance of variable selection.
For , let
denote the estimate of B under the bth replication. The corresponding selected model is denoted by
. The empirical MS is computed as
and the empirical values of the other measures are similarly computed.
5.3. Results
Tables report the finite sample performance of the four competing methods in terms of the measures given in Table under various numbers of covariates p and signal strength . We summarize the findings as follows. (i) The MCC method is uniformly superior to the NCC method with larger coverage probability (CP), better estimation of sparsity (with larger CZ and smaller IZ), as expected. (ii) As we adopted the fixed threshold procedure for MCC and NCC, these two methods produce conservatively large coverage of predictors at the cost of large model size. For the same reason, the percentage of incorrect zero is larger than the other two regression-based methods (UFR and NFR). So the resulting percentages of correctly fitted models for MCC and NCC are zero. (iii) When comparing UFR with NFR, the UFR demonstrates its superiority over NFR uniformly in all five measures across all five models (including Model 5 with the non-normal distribution). This corroborates the advantage of UFR in utilizing the correlation within responses over NFR. When comparing UFR with PWL, both methods perform comparably when the signal strength is as small as
. When the signal strength is as large as
or
, UFR outperforms PWL in all five measures in general. (iv) The UFR method performs inferior to the MCC method in cases of ultra-high dimensional covariates especially under lower signal strength, as pointed out earlier. For instance, in Model 1 of Table , the coverage probability of UFR reduces by
while the counterpart of MCC reduces by
when the dimension of predictors p increases from 1000 to 10,000 at the signal strength of
. (v) As for the impact of the signal strength, the percentage of incorrect zeros rises under the weak signal strength cases from those under the strong signal strength cases. It is consistent with the findings for the univariate case in H. Wang (Citation2009) and Y. Li et al. (Citation2017). However, as the signal strength increases (e.g., from
to
), the percentages of coverage probability (CP) and probability of correct fit (CF) increase significantly (e.g.,
to
and
to
, respectively, with p = 5000) and the percentage of incorrect zeros (IZ) drops quickly (e.g., from
to
with p = 5000) by both NFR and UFR as seen in Table . (vi) To examine the impact of the sample size, Table reports the performance of the proposed methods under Model 1 with a number of covariates p fixed at 5000 and varying sample size n to be 100, 200, and 400, respectively. It is seen that the measures of model size (MS), coverage probability (CP), probability of correct fit (CF) and probability of incorrect zero (IZ) are sensitive to sample size. The improvement of performance is significant. For instance, when the sample size increases from n = 100 to n = 200 with signal strength
, the CP increases from
to
on average and the percentage of incorrect zero drops from
to
on average.
Table 3. Five measures of the performance of variable selection defined in Table obtained by the four competing methods under various numbers of covariates (p) and signal-to-noise ratio () for Model 1 in Table with
.
Table 4. Five measures of the performance of variable selection defined in Table obtained by the four competing methods under various numbers of covariates (p) and signal to noise ratio () for Model 2 in Table with
.
Table 5. Five measures of the performance of variable selection defined in Table obtained by the four competing methods under various numbers of covariates (p) and signal-to-noise ratio () for Model 3 in Table with
.
Table 6. Five measures of the performance of variable selection defined in Table obtained by the four competing methods under various numbers of covariates (p) and signal to noise ratio () for Model 4 in Table with
.
Table 7. Five measures of the performance of variable selection defined in Table obtained by the four competing methods under various numbers of covariates (p), and signal-to-noise ratio () for Model 5 in Table with
.
Table 8. Five measures of the performance of variable selection defined in Table obtained by the four competing methods under various sample sizes (n) and signal-to-noise ratio () for Model 1 in Table with
.
In conclusion, the MCC method performs better when the dimension of covariates is ultra-high () with respect to the sample size and the UFR method outperforms the MCC method when the dimension of covariates is of polynomial order (
).
6. Real data application
We apply the proposed methods to a real data set regarding bone mineral density (BMD) (Reppe et al., Citation2010). The data were collected from 84 post-menopausal Caucasian women aged from 50 to 86. For each subject, there are two responses, namely the body mass index and total hip z-score (a measure of how strong the bone in the hip), and 8649 gene expression levels in trans-iliacal bone biopsies served as covariates. It is known that low bone mineral density is usually related to fragile bone and osteoporosis and progressive reduction of bone strength which leads to increasing susceptibility of bone fractures (Cooper, Citation1997; Reppe et al., Citation2010). The goal of the study is to identify the genes that are related to BMD.
Table reports the genes identified by the five competing methods. The MCC method identified 19 genes which include all 13 genes identified by NFR except gene TNK2. The PWL method identified 12 genes which all identified by NFR except PAIP1. And the UFR found 10 significant genes which are all contained in the set identified by NFR.
Table 9. Selected genes for the BMD data.
To examine the quality of variable selection of these methods, we compare the prediction mean square error () obtained by the three methods. To this end, we randomly split the data into a training set of 60 samples and a testing set of the remaining 24 samples. The average prediction mean square errors over the 100 replications for MCC, NCC, NFR and UFR are 273.7, 293.5, 271.0 and 241.6, respectively. Clearly, the UFR method is the winner. All the eight genes (ACSL3, NIPSNAP3B, DLEU2, C1ORF61, DKK1, SOST, ABCA8, and AFFX-M27830-M-at) identified by Reppe et al. (Citation2010) were selected by the four competing methods. The UFR method discovered two more genes, RNF216 and PLIN5, with the smallest prediction mean square errors. Similar to the findings in simulation, both MCC and NCC selected more genes than NFR and UFR with larger prediction error.
7. Conclusion
We propose two methods for variable screening in high-dimensional multivariate linear regression. The MCC method has the advantage of computational ease and can provide fast variable screening to obtain an accurate subset with a dimension below the ample size. The proposed UFR method has the feature of discovering all relevant predictors consistently at nearly the same computational cost as the univariate forward regression. The performance of UFR is sensitive to the dimensionality and signal strength. Our theory assumes Gaussian distribution for the response variables. The numerical study also shows the robustness of the proposed methods against non-normality. It is of interest to investigate the problem under more general non-homogeneously sparse assumption and nonlinear models.
Disclosure statement
No potential conflict of interest was reported by the author(s).
References
- Anderson, T. (2003). An introduction to statistical multivariate analysis (3rd ed.). Wiley.
- Bickel, P. J., & Levina, E. (2008). Regularized estimation of large covariance matrices. The Annals of Statistics, 36(1), 199–227. https://doi.org/10.1214/009053607000000758
- Breiman, L. (1995). Better subset regression using the nonnegative garrote. Technometrics, 37(4), 373–384. https://doi.org/10.1080/00401706.1995.10484371
- Breiman, L., & Friedman, J. H. (1997). Predicting multivariate responses in multiple linear regression. Journal of the Royal Statistical Society: Series B, 59(1), 3–54. https://doi.org/10.1111/rssb.1997.59.issue-1
- Cai, T., Li, H., Liu, W., & Xie, J. (2013). Covariate–adjusted precision matrix estimation with an application in genetical genomics. Biometrika, 100(1), 139–156. https://doi.org/10.1093/biomet/ass058
- Cai, T., Liu, W., & Luo, X. (2011). A constrained l1 minimization approach to sparse precision matrix estimation. Journal of the American Statistical Association, 106(494), 594–607. https://doi.org/10.1198/jasa.2011.tm10155
- Cai, T., & Lv, J. (2007). Discussion: The Dantzig selector: Statistical estimation when p is much larger than n. Annals of Statistics, 35(6), 2365–2369. https://doi.org/10.1214/009053607000000442
- Candes, E., & Tao, T. (2007). The Dantzig selector: Statistical estimation when p is much larger than n. Annals of Statistics, 35(6), 2313–2351. https://doi.org/10.1214/009053606000001523
- Chen, L., & Huang, J. Z. (2012). Sparse reduced-rank regression for simultaneous dimension reduction and variable selection. Journal of the American Statistical Association, 107(500), 1533–1545. https://doi.org/10.1080/01621459.2012.734178
- Cooper, C. (1997). The crippling consequences of fractures and their impact on quality of life. American Journal of Medicine, 103(2), 12–19. https://doi.org/10.1016/S0002-9343(97)90022-X
- Deshpande, S., Rockova, V., & George, E. (2019). Simultaneous variable and covariance selection with the multivariate Spike- and Slab lasso. Journal of Computational and Graphical Statistics, 28(4), 921–931. https://doi.org/10.1080/10618600.2019.1593179
- Efron, B., Hastie, T., Johnstone, I., & Tibshirani, R. (2004). Least angle regression. Annals of Statistics, 32(2), 407–499. https://doi.org/10.1214/009053604000000067
- Fan, J., & Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456), 1348–1360. https://doi.org/10.1198/016214501753382273
- Fan, J., & Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space (with discussion). Journal of the Royal Statistical Society: Series B, 70(5), 849–911. https://doi.org/10.1111/rssb.2008.70.issue-5
- Fan, J., & Peng, H. (2004). Non-concave penalized likelihood with a diverging number of parameters. Annals of Statistics, 32(3), 928–961. https://doi.org/10.1214/009053604000000256
- Fang, K.-T., Kotz, S., & Ng, K. W. (2018). Symmetric multivariate and related distributions. Chapman and Hall/CRC.
- Ferte, C., Trister, A. D., Erich, H., & Bot, B. (2013). Impact of bioinformatic procedures in the development and translation of high-throughput molecular classifiers in oncology. Clinical Cancer Research, 19(16), 4315–4325. https://doi.org/10.1158/1078-0432.CCR-12-3937
- Frank, L., & Friedman, J. (1993). A statistical view of some chemometrics regression tools. Technometrics, 35(2), 10–135. https://doi.org/10.1080/00401706.1993.10485033
- Fu, W. J. (1998). Penalized regressions: The bridge versus the lasso. Journal of Computational and Graphical Statistics, 7(3), 397–416. https://doi.org/10.1080/10618600.1998.10474784
- He, K., Lian, H., Ma, S., & Huang, J. Z. (2018). Dimensionality reduction and variable selection in multivariate varying-coefficient models with a large number of covariates. Journal of Statistical Planning and Inference, 113(522), 746–754. https://doi.org/10.1080/01621459.2017.1285774
- Jia, B., Xu, S., Xiao, G., & Lambda, V. (2017). Learning gene regulatory networks from next generation sequencing data. Biometrics, 73(4), 1221–1230. https://doi.org/10.1111/biom.v73.4
- Kim, S., Sohn, K., & Xing, E. (2009). A multivariate regression approach to association analysis of a quantitative trait network. Bioinformatics (Oxford, England), 25(12), 204–212. https://doi.org/10.1093/bioinformatics/btp218
- Kong, X., Liu, Z., Yao, Y., & Zhou, W. (2017). Sure screening by ranking the canonical correlations. Test, 26(1), 46–70. https://doi.org/10.1007/s11749-016-0497-z
- Kong, Y., Zheng, Z., & Lv, J. (2016). The constrained Dantzing selector with enhanced consistency. Journal of Machine Learning Research, 17(123), 1–22.
- Lee, W., & Liu, Y. (2012). Simultaneous multiple response regression and inverse covariate matrix estimation via penalized Gaussian maximum likelihood. Journal of Multivariate Analysis, 111, 241–255. https://doi.org/10.1016/j.jmva.2012.03.013
- Li, B., Chuns, H., & Zhao, H. (2012). Sparse estimation of conditional graphical models with application to gene networks. Journal of American Statistical Association, 107(497), 152–167. https://doi.org/10.1080/01621459.2011.644498
- Li, C., & Li, H. (2008). Network-constrained regularization and variable selection for analysis of genomic data. Bioinformatics (Oxford, England), 24(9), 1175–1182. https://doi.org/10.1093/bioinformatics/btn081
- Li, G., Peng, H., Zhang, J., & Zhu, L. (2012). Robust rank correlation based screening. Annals of Statistics, 40(3), 1846–1877. https://doi.org/10.1214/12-AOS1024
- Li, Y., Li, G., Lian, H., & Tong, T. (2017). Profile forward regression screening for ultra-high dimensional semiparametric varying coefficient partially linear models. Journal of Multivariate Analysis, 155, 133–150. https://doi.org/10.1016/j.jmva.2016.12.006
- Li, Y., Nan, B., & Zhu, J. (2015). Multivariate sparse group lasso for the multivariate multiple linear regression with an arbitrary group structure. Biometrics, 71(2), 354–363. https://doi.org/10.1111/biom.v71.2
- Liang, H., Wang, H., & Tsai, C.-L. (2012). Profiled forward regression for ultrahigh dimensional variable screening in semiparametric partially linear model. Statistica Sinica, 22(2), 531–554. https://doi.org/10.5705/ss.2010.134
- Obozinski, G., Wainwright, M. J., & Jordan, M. I. (2011). Support union recovery in high-dimensional multivariate regression. Annals of Statistics, 39(1), 1–47. https://doi.org/10.1214/09-AOS776
- Pecanka, J., van der Vaart, A. W., & Marianne, J. (2019). Modeling association between multivariate correlated and high-dimensional sparse covariates: The adaptive SVS method. Journal of Applied Statistics, 46(5), 893–913. https://doi.org/10.1080/02664763.2018.1523377
- Peng, J., Zhu, J., Bergamaschi, A., Han, W., Noh, D.-Y., Pollack, J. R., & Wang, P. (2010). Regularized multivariate regression for identifying master predictors with application to integrative genomics study of breast cancer. Annals of Applied Statistics, 4(1), 53–77. https://doi.org/10.1214/09-AOAS271
- Ravikumar, P., Wainwright, M., & Lafferty, J. (2010). High-dimensional Ising model selection using l1 regularized logistic regression. Annals of Statistics, 38(3), 1287–1319. https://doi.org/10.1214/09-AOS691
- Ren, J., Du, Y., Li, S., Ma, S., Jiang, Y., & Wu, C. (2019). Robust network-based regularization and variable selection for high-dimensional genomic data in cancer prognosis. Genetic Epidemiology, 43(3), 276–291. https://doi.org/10.1002/gepi.2018.43.issue-3
- Ren, J., He, T., Li, Y., Liu, S., Du, Y., Jiang, Y., & Wu, C. (2017). Network-based regularization for high dimensional SNP data in the case-control study of type 2 diabetes. BMC Genetics, 18(1), 44. https://doi.org/10.1186/s12863-017-0495-5
- Reppe, S., Refvem, H., Gautvik, V. T., Olstad, O. K., H∅vring, P. I., Reinholt, F. P., Holden, M., Frigessi, A., Jemtland, R., & Gautvik, K. M. (2010). Eight genes are highly associated with BMD variation in postmenopausal Caucasian women. Bone, 46(3), 604–612. https://doi.org/10.1016/j.bone.2009.11.007
- Rothman, A. J., Levina, E., & Zhu, J. (2010). Sparse multivariate regression with covariance estimation. Journal of Computational and Graphical Statistics, 19(4), 947–962. https://doi.org/10.1198/jcgs.2010.09188
- Saulis, L., & Statulevicius, V. (1991). Limit theorems for large deviations (Vol. 73). Springer Science & Business Media.
- Setdji, C. M., & R. D. Cook (2004). K-means inverse regression. Technometrics, 46(4), 421–429. https://doi.org/10.1198/004017004000000437
- Smith, M., & Fahrmeir, L. (2007). Spatial Bayesian variable selection with application to functional magnetic resonance imaging. Journal of American Statistical Association, 102(478), 417–431. https://doi.org/10.1198/016214506000001031
- Sofer, T., Dicker, L., & Lin, X. (2014). Variable selection for high dimensional multivariate outcomes. Statistica Sinica, 24(4), 1633–1654. https://doi.org/10.5705/ss.2013.019
- Song, Y., Schreier, P. J., Ramirez, D., & Hasija, T. (2016). Canonical correlation analysis of high-dimensional data with very small sample support. Signal Processing, 128, 449–458. https://doi.org/10.1016/j.sigpro.2016.05.020
- Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B, 58(1), 267–288. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x
- Turlach, B., Venables, W., & Wright, S. (2005). Simultaneous variable selection. Technometrics, 47(3), 349–363. https://doi.org/10.1198/004017005000000139
- Wang, H. (2009). Forward regression for ultra-high dimensional variable screening. Journal of the American Statistical Association, 104(488), 1512–1524. https://doi.org/10.1198/jasa.2008.tm08516
- Wang, J., Zhang, Z., & Ye, J. (2019). Two-layer feature reduction for sparse-group lasso via decomposition of convex sets. Journal of Machine Learning, 20(163), 1–42.
- Yang, Y., & Zou, H. (2015). A fast unified algorithm for solving group-lasso penalize learning problems. Statistics and Computing, 25(6), 1129–1141. https://doi.org/10.1007/s11222-014-9498-5
- Yi, N. (2010). Statistical analysis of genetic interactions. Genetics Research, 92(5–6), 443–459. https://doi.org/10.1017/S0016672310000595
- Yin, J., & Li, H. (2011). A sparse conditional Gaussian graphical model for analysis of genetical genomics data. Annals of Applied Statistics, 5(4), 2630–2650. https://doi.org/10.1214/11-AOAS494
- Zhang, C., & Huang, J. (2008). The sparsity and bias of the lasso selection in high-dimensional linear regression. Annals of Statistics, 36(4), 1567–1594. https://doi.org/10.1214/07-AOS520
- Zhang, H. H., & Lu, W. (2007). Adaptive lasso for Cox's proportional hazards model. Biometrika, 94(3), 691–703. https://doi.org/10.1093/biomet/asm037
- Zhang, N., Yu, Z., & Wu, Q. (2019). Overlapping sliced inverse regression for dimension reduction. Analysis and Applications, 17(5), 715–736. https://doi.org/10.1142/S0219530519400013
- Zhao, W., Lian, H., & Ma, S. (2017). Robust reduced–rank modeling via rank regression. Journal of Statistical Planning and Inference, 180, 1–12. https://doi.org/10.1016/j.jspi.2016.08.009
- Zhu, L., Li, L., Li, R., & Zhu, L. (2011). Model-free feature screening for ultrahigh-dimensional data. Journal of the American Statistical Association, 106(496), 1464–1475. https://doi.org/10.1198/jasa.2011.tm10563
- Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101(476), 1418–1429. https://doi.org/10.1198/016214506000000735
- Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B, 67(2), 301–320. https://doi.org/10.1111/rssb.2005.67.issue-2
Appendix
A.1. Proof of Theorem 3.1
Let and
. When
for all
, it suffices to show that under Assumptions 3.1–3.3
(A1)
(A1) Observe that
(A2)
(A2) First, we compute the rate of convergence for
. For
, let
. (So
.) Let
. Applying the inequalities
for any V >0 and
for any
, we obtain
(A3)
(A3) Second, we compute the rate of convergence for
. Notice that
(A4)
(A4) Parallel to (EquationA3
(A3)
(A3) ), we have
(A5)
(A5) Similar to (EquationA5
(A5)
(A5) ),
(A6)
(A6) Combining (EquationA4
(A4)
(A4) ), (EquationA5
(A5)
(A5) ) and (EquationA6
(A6)
(A6) ),
(A7)
(A7) At last, combining (EquationA3
(A3)
(A3) ) and (EquationA7
(A7)
(A7) ), we get
which, coupled with (EquationA2
(A2)
(A2) ), implies (EquationA1
(A1)
(A1) ). This completes the proof.
A.2. A lemma used to prove theorem 4.1
Lemma A.1
Suppose Assumptions 3.1, 4.1 and 4.2 hold. Let denote the submatrix of
after M. Let
and
denote the corresponding estimators, respectively. Suppose
, where
and
are defined in Assumption 4.2. Then, with probability tending to one, we have
(A8)
(A8)
Proof.
The proof is similar to that for Lemma 1 of H. Wang (Citation2009). Here, we relax the normality assumption of to the elliptically contoured distribution.
First, for ,
, let
and
. By Assumption 4.1 and the additive property of elliptical contoured distribution (Fang et al., Citation2018),
,
and
.
Second, observe that
Then, following Lemma A.3 of Bickel and Levina (Citation2008), we have
(A9)
(A9) Further, for
, let
and
. Observe that by Jensen inequality there exist positive constants
,
and
such that
satisfying condition (P) of Saulis and Statulevicius (Citation1991). The same result holds when
. By Theorem 3.2 of Saulis and Statulevicius (Citation1991), the first and second terms of (EquationA9
(A9)
(A9) ) are bounded, respectively, by
and
Therefore,
This, together with , implies that
(A10)
(A10) where the positive constants
,
and δ all depend on
alone (Bickel & Levina, Citation2008, Lemma A.3).
The rest of the proof follows exactly the same as that for Lemma 1 of H. Wang (Citation2009).
A.3. Proof of theorem 4.1
Our proof follows similar arguments as in the proof of Theorem 1 of H. Wang (Citation2009).
Assume that no relevant predictor has been discovered in the first ℓ iterations, i.e., . We evaluate the probability that at least one relevant will be identified in the
's iteration or equivalently its complementary probability that the predictor selected by the
's iteration is still an irrelevant one.
Let denote the subset of X corresponding to S. Let
denote the coefficient matrix under the true model.
Denote
Observe that
(A11)
(A11) Assume
. We have
(A12)
(A12) where
Further, observe that the last inequality of (EquationA12
(A12)
(A12) ) is no less than
(A13)
(A13)
where .
In what follows, we study the two terms in (EquationA13(A13)
(A13) ) separately.
Step 1: The first term of (EquationA13(A13)
(A13) ). Define
. And denote
. Then, the first term in (EquationA13
(A13)
(A13) ) can be expressed as
(A14)
(A14) Define
Thus, the RHS of (EquationA14
(A14)
(A14) ) is no less than
(A15)
(A15) where the last inequality follows after the fact
.
On the other hand, observe that (A16)
(A16) Applying (EquationA16
(A16)
(A16) ) to (EquationA15
(A15)
(A15) ) and (EquationA14
(A14)
(A14) ), using the fact that
with probability tending to one, and by Assumptions 3.1, 4.1 and Lemma 1 of H. Wang (Citation2009), we obtain
(A17)
(A17) Define
. We have
Recall the assumption at the beginning of the proof that
. Then, by Assumptions 3.1, 3.2 and 4.1, and Lemma 1 of H. Wang (Citation2009), we get
(A18)
(A18) with probability tending to one. Applying (EquationA18
(A18)
(A18) ) to (EquationA17
(A17)
(A17) ) and using Assumptions 3.2, 4.1 and 4.2, we obtain
(A19)
(A19) Step 2: The second term of (EquationA13
(A13)
(A13) ). Recall
where
Then, by Assumptions 3.1 and 3.2 and Lemma 1 of H. Wang (Citation2009), we have
.
Moreover, since , we have
(A20)
(A20) where
.
Notice that is a normal random variable with mean zero and variance given by
. Thus, the RHS of (EquationA20
(A20)
(A20) ) is no greater than
(A21)
(A21) where
stands for a chi-square random variable with one degree of freedom. By Assumptions 3.1 and 4.1, and Lemma 1 of H. Wang (Citation2009), we get that
with probability tending to one.
On the other hand, the total number of combinations for and
is no more than
. Then, by Assumption 4.2, we get
with probability tending to one. Therefore, (EquationA21
(A21)
(A21) ) is bounded by
(A22)
(A22) Combining (EquationA13
(A13)
(A13) ), (EquationA19
(A19)
(A19) ) and (EquationA22
(A22)
(A22) ), we have
(A23)
(A23) uniformly for every
. Recall
defined in Section 4. Then, by Assumption 4.2 and 4.3, we have
(A24)
(A24) where
is the Frobenius norm.
Without loss of generality, we can assume , and we have
. (Otherwise, we can standardize it by letting
for
and
.) Thus, it is impossible to have
for every
, which implies that at least one relevant variable will be discovered within
steps. This completes the proof.