![MathJax Logo](/templates/jsp/_style2/_tandf/pb2/images/math-jax.gif)
Abstract
High-dimensional data analysis has been a challenging issue in statistics. Sufficient dimension reduction aims to reduce the dimension of the predictors by replacing the original predictors with a minimal set of their linear combinations without loss of information. However, the estimated linear combinations generally consist of all of the variables, making it difficult to interpret. To circumvent this difficulty, sparse sufficient dimension reduction methods were proposed to conduct model-free variable selection or screening within the framework of sufficient dimension reduction. We review the current literature of sparse sufficient dimension reduction and do some further investigation in this paper.
1. Introduction
The rapid development of data collection technology in areas, such as biology, financial econometrics and signal processing, has posed a great challenge for traditional multivariate analysis. High-dimensional data analysis becomes ubiquitous and increasingly important. Dimension reduction, and in particular sufficient dimension reduction for regression, offers an appealing avenue to tackle high-dimensional problems. It is often desirable to reduce the dimensionality of the problem by replacing the original high-dimensional data with a low-dimensional space composed of a few linear combinations of predictors, which are usually much smaller than the original dimension. Although sufficient dimension reduction is an effective way to extract relevant information from high-dimensional data sets, while grasping the important features or patterns in the data, the linear combinations usually consist of all original predictors which makes the interpretation difficult. This limitation can be overcome via variable selection, where a subset of relevant predictor variables is selected. The removal of the excess variables not only can reduce the noise to the precise estimation, alleviate the collinearity issue, but also help reduce the computational cost caused by high-dimensional data.
As one of the most important dimension reduction approaches, many variable selection methods have been developed. Some most popular variable selection approaches are developed under the linear model or the generalised linear model paradigm, such as nonnegative garrotte (Breiman, Citation1995), the least absolute shrinkage and selection operator (Lasso, hereafter) (Tibshirani, Citation1996), the smoothly clipped absolute deviation (SCAD, hereafter) (Fan & Li, Citation2001), adaptive Lasso (Zou, Citation2006), group Lasso (Yuan & Lin, Citation2006), Dantzig selector (Candes & Tao, Citation2007) and the minimax concave plus penalty (MCP, hereafter) (Zhang, Citation2010).
These model-based variable selection methods assume the underlying true model is known up to a finite dimensional parameter or the imposed working model is usefully similar to the true model. However, the true model might be in a complex form and it is usually unknown. If the underlying modelling assumption is violated, these variable selection methods might fail. Hence, model-free variable selection method, which does not require the full knowledge of the underlying true model, is called for. It has been shown that the general framework of sufficient dimension reduction is useful for variable selection (Bondell & Li, Citation2009) since no pre-specified underlying models between the response and the predictors are required. So model-free variable selection can be achieved through the framework of SDR (Cook, Citation1998; Li, Citation1991, Citation2000).
Let be the predictor and
be the scalar response. The goal of variable selection is to seek the smallest subset of the predictors
, with partition
, such that
(1)
(1) Here
denotes a subset of indices of
corresponding to the relevant predictor set
, and
is the complement of
, i.e.,
and
. Condition (Equation1
(1)
(1) ) implies that
contains all the active predictors in terms of predicting
. The existence and uniqueness of
were discussed in details in Yin and Hilafu (Citation2015). Ideally, we want to find the smallest index set
satisfying (Equation1
(1)
(1) ), in which case no inactive predictors are included in
.
Model-free variable selection is closely related to sufficient dimension reduction, which aims to find with
, such that
(2)
(2) that is,
is independent of
conditioning on
. The column space of such
,
, is called a dimension reduction space. Under mild assumptions, such as given in Cook (Citation1996) and Yin et al. (Citation2008), the intersection of all such spaces is itself a dimension reduction space. In this case, we call the intersection the central subspace for the regression of
on
, and denote it by
. And its dimension,
, is usually much smaller than p, the dimension of the original predictor. Following the partition of
, we can partition
accordingly as
where
is the cardinality of
. Hence, (Equation1
(1)
(1) ) is equivalent to
.
Many methods have been proposed for estimating the basis of in the literature, including sliced inverse regression (SIR, hereafter) (Li, Citation1991), sliced average variance estimation (SAVE, hereafter) (Cook & Weisberg, Citation1991), principal Hessian directions (PHD, hereafter) (Li, Citation1992), minimum average variance estimation (MAVE, hereafter) (Xia et al., Citation2002), directional regression (DR, hereafter) (Li & Wang, Citation2007), principal fitted component (PFC, hereafter) (Cook & Forzani, Citation2008), semiparametric approach (Ma & Zhu, Citation2012), etc. Several methods have been also suggested for simultaneously selecting the contributing predictors. These include shrinkage SIR (Ni et al., Citation2005), sparse SIR (Li, Citation2007; Li & Nachtsheim, Citation2006), sparse SAVE and sparse PHD (Li, Citation2007), constrained canonical correlation (Zhou & He, Citation2008), the general shrinkage strategy for inverse regression estimation (Bondell & Li, Citation2009), the regularised SIR estimator with SCAD penalty (Wu & Li, Citation2011) and coordinate independent sparse estimation (CISE, hereafter) (Chen et al., Citation2010), conditional covariance minimisation (Chen et al., Citation2017), etc.
Although these aforementioned methods can select the significant predictors without assuming an underlying parametric model, they are not designed for problems, in which the number of predictor variables is larger than the number of observations. The so-called large p small n problems are increasingly common with rapid technological advances in data collection and have attracted a lot of research interests. We hereby give a very brief review of model-free variable selections via sufficient dimension reduction approach under the
setting. Li and Yin (Citation2008) proposed sparse ridge SIR, which combined SIR with both
- and
-regularisation to achieve dimension reduction and variable selection simultaneously, even when p>n. Yu et al. (Citation2013) suggested combining SIR with the Dantzig selector (Candes & Tao, Citation2007) to recover the central subspace in the general semiparametric models. A non-asymptotic error bound for the resulting estimator is derived and the error bound is of order
, which appears to be optimal. Moreover, they proposed another regularised version of SIR with the adaptive Dantzig selector. The resulting estimators defined from variable selection are asymptotically normal even when the predictor dimension diverges to infinity. It is worth mentioning that the
is fixed in Yu et al. (Citation2013). Yu, Dong, Zhu (Citation2016) proposed trace pursuit for model-free variable selection under the sufficient dimension reduction paradigm. Two distinct algorithms are proposed: stepwise trace pursuit (STP, hereafter) and forward trace pursuit (FTP, hereafter). Stepwise trace pursuit achieved selection consistency with fixed p and is applicable in the setting with p>n. Furthermore, forward trace pursuit can serve as an initial screening step to speed up the computation in the case of ultrahigh dimensionality. Li and Dong (Citation2020) extended trace pursuit method to matrix-valued predictors based on Yu, Dong, Zhu (Citation2016). To test the importance of rows, columns and submatrices of the predictor matrix in terms of predicting the response, three types of hypotheses are formulated under a unified framework. The asymptotic properties of the test statistics under the null hypothesis are established and a permutation testing algorithm is also introduced to approximate the distribution of the test statistics. Tan et al. (Citation2018) developed a convex formulation for fitting sparse SIR in high dimensions. They solved the resulting convex optimisation problem via the linearised alternating direction methods of multiple algorithms and established an upper bound on the subspace distance between the estimated and the true subspaces. Unlike Yu et al. (Citation2013), Lin et al. (Citation2019) allowed
goes to infinity. By constructing artificial response variables made up from top eigenvectors of the estimated conditional covariance matrix, Lin et al. (Citation2019) introduced a simple Lasso regression method to obtain an estimator of the sufficient dimension reduction space. The resulting algorithm, Lasso-SIR, is shown to be consistent and achieves the optimal convergence rate under certain sparsity conditions when p is of order
, where c is the generalised signal-to-noise ratio, which is only the first step of Tan et al. (Citation2020). Moreover, Tan et al. (Citation2020) discovered the possible trade-off between statistical guarantee and computational performance for sparse SIR and proposed an adaptive estimation scheme for sparse SIR which is computationally tractable and rate optimal under the condition that
, which is weaker than Lin et al. (Citation2019).
There is considerable literature on applying sufficient dimension reduction for model-free selection, but the study of developing screening consistency for the ultra-high dimensional setting is still lacking. To fulfil the aforementioned gaps, Zhu et al. (Citation2011) proposed a variable screening procedure under a unified model framework, which contains a wide variety of commonly used parametric and semiparametric models. The new method does not require imposing a specific model structure on regression functions and thus is particularly appealing to ultrahigh-dimensional regressions. They also showed the proposed method achieves screening consistency even with the number of predictors growing at an exponential rate of the sample size. Yu, Dong, Shao (Citation2016) proposed an approach called marginal SIR for model-free variable selection. Furthermore, marginal SIR with Dantzig selector exploits the sparsity structure in the marginal utility and achieves the desirable selection consistency property. Lin et al. (Citation2017) first introduced a large class of models depending on the smallest non-zero eigenvalue of the kernel matrix of SIR, then the minimax rate for estimating the central space is derived, which is the first paper studied the minimax estimation of sparse SIR. However, they only considered the projection loss (Li & Wang, Citation2007). More importantly, their theoretical study is based on the assumption that the covariance matrix is diagonal. As far as we know, most of mentioned work mainly focus on SIR with consistency on variable selection. Qian et al. (Citation2019) provided simultaneous analysis for PFC and SAVE. Furthermore, their approach allows many quantities such as the structural dimension, the number of important predictors and the number of slices to diverge with n. To deliver the most essential messages, in the following section, we focus our discussion on the papers mentioned above.
2. Review of sufficient dimension reduction
Sufficient dimension reduction aims to find the column space of with the smallest dimension d. In other words, sufficient dimension reduction is proposed as a problem of estimating a space, instead of the classic statistical problem of estimating parameters. As mentioned in the introduction, there are many approaches in the literature of sufficient dimension reduction for estimating the column space
: sliced inverse regression (SIR; Li, Citation1991), sliced average variance estimation (SAVE; Cook & Weisberg, Citation1991), minimum average variance estimation (MAVE; Xia et al., Citation2002), the kth moment estimation (Yin & Cook, Citation2002, Citation2003), inverse regression (Cook & Ni, Citation2005), directional regression (DR; Li & Wang, Citation2007), sliced regression (SR; Wang & Xia, Citation2008), likelihood acquired directions (LAD; Cook & Forzani, Citation2009), semiparametric approaches (Ma & Zhu, Citation2012, Citation2013a, Citation2013b, Citation2014), etc. We mainly review three inverse regression-based methods (SIR; SAVE and DR) for estimating
for our subsequent investigation.
Inverse regression methods constitute the oldest class of dimension reduction methods and are still under active development currently. The main idea of the inverse regression is to reverse the relation between the response and the predictors (Li, Citation1991). Instead of considering distributions or expectations of functions of Y conditional on , which suffers the curse of dimensionality when
is high dimensional, these inverse regression-based methods consider expectations of functions of
conditional on Y, which is suddenly a low dimensional problem because Y is univariate. The inverse regression-based methods often are based on some additional assumptions on the predictors to link the low dimensional problem and the original high dimensional problem. These additional assumptions are given as follows.
(W1) | linearity condition | ||||
(W2) | constant variance condition |
where . As is known to all, SIR only requires the condition (W1) holds. However, SAVE and DR need both conditions.
When the linearity condition and the constant variance condition are satisfied, the inverse regression methods formulate the problem of estimating into an eigen-decomposition problem. Let
be the kernel matrix of a specific inverse regression based dimension reduction method. For the sufficient dimension reduction methods that aim to estimate
, the kernel matrices corresponding to the three most well-known inverse regression methods are summarised as below:
Assuming
is known, the procedure for a generalised eigenvalue-decomposition of the kernel matrix
, that is
where
, and
are the eigenvalues. Then the eigenvectors corresponding to the nonzero eigenvalues
form a basis of
. Thus the sufficient dimension reduction directions
can also be identified through the following optimisation problem (Tan et al., Citation2020):
(3)
(3)
3. The current literature of variable selection via sufficient dimension reduction
3.1. Oracle property under the setting p<n
In the general framework of condition (Equation1(1)
(1) ), the shrinkage SIR method is developed in Ni et al. (Citation2005) by applying the Lasso approach to SIR. When a subset of predictors are irrelevant, then the corresponding row estimates of
is equal to 0, and consequently to achieve variable selection. Let
, with
,
, be the shrinkage vector. Then based on the expression (Equation3
(3)
(3) ), the estimation of the shrinkage vector can be rewritten to minimise the following function over α (Ni et al., Citation2005):
(4)
(4) where
is the estimator of kernel matrix
. To investigate the asymptotic behaviour, we consider the Lagrangian formulation of the constrained optimisation problem. Specially, the optimisation problem in expression (Equation4
(4)
(4) ) can be reformulated as
for some non-negative penalty constant
. In which,
Then the central dimension reduction subspace
is estimated by
. Li (Citation2007) extended shrinkage SIR method to SAVE and PHD methods, where the central dimension subspace is estimated the same as Ni et al. (Citation2005), and
corresponds to the estimated central dimension reduction directions of SAVE and PHD methods, respectively. Bondell and Li (Citation2009) proposed a general shrinkage estimation strategy for the entire inverse regression estimation family that is capable of simultaneous sufficient dimension reduction and variable selection. They considered the adaptive Lasso,
where
is a known weights vector. They also demonstrated that the proposed class of shrinkage estimators has the desirable oracle property of consistency in variable selection while retaining root n estimation consistency.
However, most existing sparse dimension reduction methods mentioned above are conducted stepwise, estimating a sparse solution for a basis matrix of the central subspace column by column. Instead, Chen et al. (Citation2010) proposed a unified one-step approach to reduce the number of variables appearing in the estimate of . Their approach, which depends operationally on Grassmann manifold optimisation, can achieve dimension reduction and variable selection simultaneously. Additionally, their proposed method has the oracle property: under mild conditions, the proposed estimator would perform asymptotically as well as if the true irrelevant predictors were known. More importantly, Chen et al. (Citation2010) is an extension to Bondell and Li (Citation2009), which combined SIR, SAVE, DR with adaptive Lasso to variable selection. Zhou and He (Citation2008) proposed a constrained canonical correlation procedure (
) based on imposing the
-norm constraint on the effective dimension reduction estimates in CANCOR, followed by a simple variable selection method. Using the B-spline basis functions generated for the response variable, the CANCOR method (Fung et al., Citation2002) is asymptotically equivalent to SIR. Suppose that the range of
is a bounded interval
, given
interval knots in
and the spline order m, we generate
B-spline basis functions. Under the linearity condition, CANCOR estimates a set of effective dimension reduction directions by estimating the canonical variates between the B-spline basis functions and
. Since the generated
B-spline basis functions add to 1, we use in CANCOR the first
basis function of
,
. Let
and
be the data matrices containing the predictor values and the B-spline basis function values. Then the CANCOR method is to estimate the canonical correlations between the columns of
and the columns of Π. The dimensionality of the central dimension reduction subspace is selected by performing the following sequential tests on the number of the non-zero canonical correlations,
versus
for
, where
are the asymptotic canonical correlations between
and
in decreasing order. The dimensionality estimate for d is the smallest s such that
is not rejected. The CANCOR method actually solves an optimisation problem that sequentially finds the directions
with the maximum correlation between
and some functions of
. Their procedure is attractive because they demonstrated that it also has the oracle property.
Sparse sufficient dimension reduction methods mentioned above focus on the cases when p is fixed. For regressions with diverging p, estimation and variable selection methods are also developed in the framework of sufficient dimension reduction: Zhu et al. (Citation2006) studied the asymptotic properties of SIR as p diverges, but their result is for SIR only, and variable selection is not studied at all. Zhu and Zhu (Citation2009a) investigated weighted partial least squares with a diverging p, but again variable selection is not derived. Zhu and Zhu (Citation2009b) investigated variable selection with a diverging number of predictors through inverse regression, but focused on single-index models only. By contrast, Wu and Li (Citation2011) established asymptotic properties for a family of inverse regression estimators that includes SIR, studied simultaneous dimension reduction and variable selection with a particular emphasis on the latter and encompassed more general forms, while the number of predictors p is allowed to diverge as the sample size n approaches infinity. Wu and Li (Citation2011) adopted the SCAD type penalty that was first introduced by Fan and Li (Citation2001), and combined it with sufficient dimension reduction estimator, that is
The penalty
are not necessarily the same for all i. Wu and Li (Citation2011) also showed that the penalised estimator selects all truly contributing predictors and excludes all irrelevant ones with probability approaching one.
Based on the work in kernel dimension reduction, Chen et al. (Citation2017) proposed a method to perform feature selection via a constrained optimisation problem. The corresponding SDR method can refer to Fukumizu et al. (Citation2009); Fukumizu Leng (Citation2014). Many previous kernel approaches are filter methods based on the Hilbert–Schmidt Independence Criterion (HSIC, Gretton et al., Citation2005). Chen et al. (Citation2017) proposed to use the trace of the conditional covariance operator as a criterion for feature selection. Let denote an RKHS supported on
. Then the trace of the conditional covariance operator,
can be interpreted as a dependence measure, as long as the
is large enough. Then the problem of supervised feature selection reduces to minimising the trace of the conditional covariance operator over subsets of features with controlled cardinality:
They also showed that empirical estimate of the criterion is consistent as the sample size increases. It is worth noting that kernel feature selection methods have the advantage of capturing nonlinear relationships between the features and the labels.
Theorem 3.1
Assume for some
and that
. Suppose that
and
with
then the shrinkage estimator
satisfies
consistency in variable selection,
and
asymptotic normality,
for some
.
Remark 3.1
Theorem 3.1, part (a), indicates that the sparse sufficient dimension reduction estimator can select contributing predictors consistently, i.e., for all we have
, and for all
we have
. Theorem 3.1, part (b), further shows that the estimator for
that corresponds to the contributing predictors is root n consistent. The oracle property as shown in Theorem 3.1 is given in Bondell and Li (Citation2009), Chen et al. (Citation2010), Wu and Li (Citation2011) and Zhou and He (Citation2008). Most of the methods mentioned above cannot achieve the desired property with p>n, however, Wu and Li (Citation2011) showed that their proposed method can obtain selection consistency when p diverge as the sample size n goes to infinity. Then we turn to investigate the oracle property with p>n.
3.2. Oracle property under the setting ![](//:0)
![](//:0)
Large-p-small-n problems appear frequently in fields such as biology, economics and finance. While those variable selection methods have been successfully applied in many high-dimensional analyses, modern applications in areas such as genomics and high-frequency finance further push the dimensionality of data to an even larger scale, where p may grow exponentially with n. Such ultrahigh-dimensional data present simultaneous challenges of computational expediency, statistical accuracy and algorithm stability. It is difficult to directly apply the aforementioned variable selection methods to those ultrahigh-dimensional statistical learning problems due to the computational complexity inherent in those methods. To reduce the predictor dimension in semiparametric regressions, Yu et al. (Citation2013) proposed a -minimisation of SIR with the Dantzig selector (Candes & Tao, Citation2007), which is defined as
(5)
(5) where
,
,
and
is a
zero vector. Furthermore, they established a non-asymptotic error bound for the resulting estimator when
is fixed. Yu et al. (Citation2013) also extended the regularisation concept to SIR with an adaptive Dantzig selector, which is defined by
(6)
(6) where
is the a known weight matrix and
is a specified positive value, which should vary inversely with the magnitude of
. Yu et al. (Citation2013) proposed a two-step estimation procedure to select the contributing predictors. In the first step, they screened out informative predictors based on (Equation5
(5)
(5) ). This is called Dantzig selector based SIR. In the second step, they enhance the sparsity and the estimation efficiency with (Equation6
(6)
(6) ), based on the predictors selected in the first step, called iterative adaptive Dantzig selector based SIR. This ensures that all contributing predictors are selected with high probability and that the resulting estimator is asymptotically normal even when the predictor dimension diverges to infinity.
However, there is a gap between the optimisation problem and the theoretical results: there is no guarantee that the estimator obtained from solving the proposed biconvex optimisation problem is the global minimum. Most existing work in the high-dimensional sufficient dimension reduction literature involves nonconvex optimisation problems. Moreover, they seek to estimate a set of reduced predictors that are not identifiable by definition, rather than the central subspace. Yin and Hilafu (Citation2015) proposed a sequential approach for estimating high-dimensional SIR. Both proposals are stepwise procedures that do not correspond to solving a convex optimisation problem. Moreover, as discussed in Yin and Hilafu (Citation2015), theoretical properties for their proposed estimators are hard to establish due to the sequential procedure used to obtain the estimators. In the high-dimensional setting, Lin et al. (Citation2018) proposed a screening approach to perform variable selection and established an error bound for the estimators, which allows goes to infinity. The selected variables are then used to fit classic SIR. Furthermore, the resulting algorithm is shown to be consistent and achieved the optimal convergence rate under certain sparsity conditions when p is of order
, where c is the generalised signal-to-noise ratio. Tan et al. (Citation2018) proposed a convex formulation for sparse SIR in the high-dimensional setting by adapting techniques from the sparse canonical correlation analysis. Their proposal estimates the central subspace directly and performs variable selection simultaneously. Moreover, the proposed method can be adapted for sufficient dimension reduction methods that can be formulated as generalised eigenvalue problems.
As mentioned in introduction, most literature mainly focus on SIR with consistency on variable selection. Qian et al. (Citation2019) proposed methods under a unified minimum discrepancy framework with regularisation. Consistency results in both central subspace estimation and variable selection are established simultaneously for some famous SDR methods, including SIR, PFC and SAVE. More importantly, their approach allows many quantities such as the structural dimension, the number of important predictors and the number of slices to diverge with n. Unlike many high-dimensional SDR methods, their method did not necessarily require a sparsity condition on the predictor covariance matrix or the maximum eigenvalue of the predictor covariance matrix to be upper bounded. Furthermore, they developed a new algorithm that can efficiently solve a general class of high-dimensional sparse minimum discrepancy problems.
Many SDR methods can be rewritten as a minimisation problem using an objective function of the form
(7)
(7) where
,
and
are sample estimates for the population matrices
,
and
. Here,
is a
kernel matrix associated with a particular SDR method, where
,
is some
positive definite matrix,
and
represent parameters to be estimated by minimisation of
. The general form of (Equation7
(7)
(7) ) is an adaptation of the minimum discrepancy approach proposed by Cook and Ni (Citation2005). To identify the correct sparsity structure of
under
scenarios, Qian et al. (Citation2019) proposed to adopt coordinate-independent regularisation approach and imposed the penalty
with tuning parameter
on (Equation7
(7)
(7) ) under the alternative constraint
, given the objective function
Given its minimiser
, they simultaneously estimated
by
and estimated
by
.
Tan et al. (Citation2020) considered four loss functions
General loss.
;
Projection loss.
;
Prediction loss.
;
Correlation loss.
,
To further address this issue, they proposed a refined sparse SIR estimator. The refined sparse SIR estimator is also rate-optimal yet computational intractable. However, its computational feasible counterpart based on the adaptive estimation procedure is proven to be nearly rate-optimal. Compared to the Lasso-SIR (Lin et al., Citation2019), which was shown to be rate optimal only when , their sparse SIR approach is rate optimal even when
. Therefore, their proposed sparse SIR estimator certainly enjoys a much wider range of applications. The reason why Lasso-SIR fails to work when
is that it requires the estimation of the eigenvalues and eigenvectors of the
non-sparse SIR kernel matrix. It is well known that the sample eigenvalues and eigenvectors are not even consistent when p/n has a nonzero limit as
. In summary, the minimax lower bound obtained, the two rate-optimal yet computational infeasible estimators, the two corresponding computational tractable counterparts, and the theoretical upper bound of the four estimators under four-loss functions together provide a thorough understanding of sparse SIR. It is also worth noting that Lin et al. (Citation2019) is just the first step of Tan et al. (Citation2020). Bondell and Li (Citation2009) demonstrated that
, then the sparse representation of SIR relies on
, the number of truly relevant predictors, where
denotes the support of
. Assuming
, sparse SIR is further defined through seeking
such that
(8)
(8) The above formulation of sparse SIR enjoys a similar fashion as that of sparse CCA (Gao et al., Citation2015). To get theoretical results, the following conditions are required.
(A1) | the conditional mean | ||||
(A2) |
| ||||
(A3) | the nonzero eigenvalues | ||||
(A4) | there exists a positive constant | ||||
(A5) |
| ||||
(A6) | the restricted isometry and restricted orthogonality constants |
See Yu et al. (Citation2013) for more details.
Theorem 3.2
Suppose that Conditions A1–A6 are satisfied, and . Then
with a probability greater than
for some τ greater than
where
and
are defined in Condition A6.
Remark 3.2
Theorem 3.2 suggests that a small price can obtain a sparse solution, as the squared estimation error of the regularised estimation is optimal up to a factor of . The consistency property of Lin et al. (Citation2018), Tan et al. (Citation2018) and Tan et al. (Citation2020) are similar with Theorem 3.2, but the threshold for
can be different.
4. The current literature of variable screening
Although there is a vast literature of applying sufficient dimension reduction for model-free selection, the result of developing screening consistency for the ultra-high dimensional setting is scant. Therefore many scholars are concentrated on investigating methods to achieve screening consistency.
4.1. Marginal utility
Yu, Dong, Shao (Citation2016) proposed an approach called marginal SIR for model-free variable selection. Since contains all the regression information between
and
, Yu, Dong, Shao (Citation2016) considered the diagonal element of
as the marginal utility for the corresponding predictor. Specially, let
be the standard unit vector in
with 1 being the kth element and 0 otherwise. They considered the following utility for
:
(9)
(9) Yu, Dong, Shao (Citation2016) refer to
as the population level marginal SIR utility. To apply Dantzig selector for the estimation of the marginal SIR utility
, they defined
,
. Let
. Then
can be written as
. Therefore
The marginal utility
is estimated by
where
and
. For a given threshold
, the active set
is estimated by including the predictors such that
exceeds
or
. Yu, Dong, Shao (Citation2016) take an example that
. Let
,
for
, and
for
,
. Let
, where
is independent of
and
. Then the active set for the linear regression models is
. Consider five utilities for
: the marginal absolute Pearson correlation from Fan and Lv (Citation2008), the marginal squared distance correlation utility from Li et al. (Citation2012), the marginal fused Kolmogorov filter utility as defined in (5.3) of Yu, Dong, Shao (Citation2016), the marginal independence SIR utility as defined in (5.1) of Yu, Dong, Shao (Citation2016) and the marginal SIR utility as defined in (Equation9
(9)
(9) ). Unfortunately, the first four independence screening methods will fail to recover the active predictor
, only marginal SIR achieves desired result.
4.2. Trace pursuit
Yu, Dong, Zhu (Citation2016) proposed trace pursuit as a novel approach for model-free variable selection. They first extended the classical stepwise regression in linear models and proposed an STP algorithm for model-free variable selection. Furthermore, they proposed the FTP algorithm. After finding a solution path by adding one predictor into the model at a time, a modified Bayesian information criterion (BIC, hereafter) provides a chosen model that is guaranteed to include all important predictors. Finally, the two-stage trace pursuit algorithm uses FTP for initial variable screening.
For working index set and index
, if we want to test
(10)
(10) For any index set
, denote
,
. Taking SIR as an example, denote
. Recall that
denotes the active index set satisfying
, and
denotes the full index set. It is worth noting that, if the assumption (W1) holds true, then for any index set
such that
,
. It suggests that
can be used to capture the strength of relationship between
and
. Denote
as the index set of j together with all the indices in
. Given that
is already in the model, then trace difference
can be used to test the contribution of the additional variable
to
. The idea of using trace difference is similar to the extra sums of squares test in the classical multiple linear regression setting. The following subset LCM assumption is required in Yu, Dong, Zhu (Citation2016),
(11)
(11) Furthermore, they also provided the STP algorithm, that is
Initialisation. Set the initial working set to be
.
Forward addition. Find index
such that
If
, update
to be
.
Backward deletion. Find index
such that
If
, update
to be
.
Repeat steps (b) and (c) until no predictors can be added or deleted.
The test for SAVE and DR can be defined in a parallel fashion if the following CCV assumption together with the subset LCM (Equation11(11)
(11) ) assumption holds true
Li and Dong (Citation2020) had a recent extension of trace pursuit to matrix-valued predictors. Suppose the response variable
and the predictor
have the following general relationship:
(12)
(12) where g:
is an unknown function, ϵ is independent of
, and
. Assume that
follows the matrix normal distribution, which is denoted as
with
,
and
. Then, the row covariance matrix is
, and the column covariance matrix is
.
Let be the full index set of rows and
be the jth row of
for
. Define the active row set
as
Similarly, let
be the full index set of columns and
be the kth column of
for
. Define the active column set
as
Based on the active row and column predictors, model (Equation12
(12)
(12) ) can be expressed as
where
with
denoting the cardinality of a set, and
denotes the submatrix of
that contains the active rows indexed by
and the active columns indexed by
. Note that Y depends on
only through
. Li and Dong (Citation2020) introduced procedures to recover the active row set
in detail. Let
,
, be the jth row of
and
be the matrix that includes all but the jth row of
. To test the importance of
, they considered the following row hypotheses:
(13)
(13) Under the null hypothesis,
, the response Y depends on
only through
. In the special case of q = 1,
becomes a p-dimensional vector, and (Equation13
(13)
(13) ) is equivalent to testing the importance of one component of
given the other p−1 predictors. This special case is known as the marginal coordinate test (Cook, Citation2004). Let
be the submatrix of
that excludes the jth row and the jth column of
. Define the following quantity:
where
and
. This trace difference
is the key quantity to test the importance of the jth row of
, which is same as Yu, Dong, Zhu (Citation2016). Note that
under
.
To develop the screening consistency for ultrahigh dimensional setting, Zhu et al. (Citation2011) proposed a novel variable screening procedure under a unified model framework, which covers a wide variety of commonly used parametric and semiparametric models. They assumed that and
for
and
for ease of explanation. It then follows by the law of iterated expectations that
. Let
be the ith element of
, and defined as
Then
is to serve as the population quantity of our proposed marginal utility measure for predictor ranking. Intuitively, one can see that, if
and
are independent, then
and the indicator function
change independently. Consequently,
. On the other hand, if
and
are related, then
must be positive. For ease of presentation, they assumed that the sample predictors are all standardised; that is,
and
for
. A natural estimator of
is
where
denotes the kth element of
. The new method does not require imposing a specific model structure on regression functions, and thus is particularly appealing to ultrahigh-dimensional regressions. They showed that, with the number of predictors growing at an exponential rate of the sample size, the proposed procedure possesses consistency in ranking, which is both useful in its own right and can lead to consistency in selection. Lin et al. (Citation2017) first introduced a large class of models depending on the smallest non-zero eigenvalue λ of the kernel matrix of SIR, then the determination of the minimax rate for estimating the central space over two classes is derived, which is the first paper that studied the minimax estimation of sparse SIR. Furthermore, they showed that the estimator based on the SIR procedure converges at rate
, which is the optimal rate for the single index models and multiple index models with fixed structural dimension d, fixed
and λ. However, Lin et al. (Citation2017) only considered the projection loss (Li & Wang, Citation2007). More importantly, their theoretical study is actually based on the assumption that covariance matrix is diagonal.
Before discussing the consistency property, we need some conditions. Taking Yu, Dong, Shao (Citation2016) as an example,
(C1) | The coverage condition: | ||||
(C2) | There exist | ||||
(C3) | There exists | ||||
(C4) | There exists | ||||
(C5) | There exists |
More details please refer to Yu, Dong, Shao (Citation2016).
Theorem 4.1
Assume above conditions hold, then the shrinkage estimator satisfies consistency in variable selection,
Theorem 4.1 is given in Yu, Dong, Shao (Citation2016), Yu, Dong, Zhu (Citation2016), Lin et al. (Citation2017), and Zhu et al. (Citation2011).
5. Minimax rate
Recently, an impressive range of penalised SIR methods has been proposed to estimate the central subspace in a sparse fashion. However, few of them considered the sparse sufficient dimension reduction from a decision-theoretical point of view. To address this issue, Tan et al. (Citation2020) established the minimax rates of convergence for estimating the sparse SIR directions under various commonly used loss functions in the literature of sufficient dimension reduction. Lin et al. (Citation2019) introduced a simple Lasso regression method to obtain an estimator of the sufficient dimension reduction space, which is only the first step of Tan et al. (Citation2020). Moreover, Tan et al. (Citation2020) discovered the possible trade-off between statistical guarantee and computational performance for sparse SIR and proposed an adaptive estimation scheme for sparse SIR which is computationally tractable and rate optimal under the condition that , which is weaker than Lin et al. (Citation2019).
As we can see that, the kernel matrix can be estimated as
Then it is natural to estimate
via replacing
and
in (Equation8
(8)
(8) ) by their sample estimators, which yields
(14)
(14) The solution
in (Equation14
(14)
(14) ) is called the natural sparse SIR estimator. The following theorem establishes the lower bound and upper bound of the four loss functions for the natural sparse SIR estimator.
Theorem 5.1
Assume for some sufficiently large constant
. Then there exist positive constants C and
such that
where
.
Theorem 5.2
Assume that for some small constant
. Then for any
there exists a positive constant C such that
with probability greater than
uniformly over
.
Since SIR can be rewritten as a least-square formulation, they finally proposed an adaptive estimation scheme for sparse SIR which is computationally tractable and rate optimal. More details about the adaptive sparse SIR estimator can refer to Tan et al. (Citation2020).
6. Further investigation
6.1. Marginal utility
Motivated by Yu, Dong, Shao (Citation2016), we can extend their method to SAVE and DR. Let . Then
can be written as
. Therefore
The marginal utility
is estimated by
where
. For a given threshold
, the active set
is estimated by including the predictors such that
exceeds
or
.
Next, we consider marginal DR with the Dantzig selector. Then
can be written as
Therefore
The marginal utility
is estimated by
For a given threshold
, the active set
is estimated by including the predictors such that
exceeds
, or
. Following the proof of Yu, Dong, Shao (Citation2016), we can expect the marginal SAVE and DR with the Dantzig selector to achieve selection consistency.
6.2. Minimax rate
Motivated by Tan et al. (Citation2020), we can further investigate the natural sparse SAVE estimator and upper error bound. Let and
be the sample mean and sample covariance of
, then the SAVE kernel matrix
is estimated as
Similarly, the DR kernel matrix is estimated as
Then it is natural to estimate
via replacing
and
in (Equation8
(8)
(8) ) by their sample estimators, which yields
(15)
(15) The solution
and
in (Equation15
(15)
(15) ) are called the natural sparse SAVE and DR estimator. The following theorem establishes the lower bound and upper bound of the four loss functions for the natural sparse SAVE and DR estimator.
Theorem 6.1
Assume for some sufficiently large constant
. Then there exist positive constants C and
such that
where
.
Theorem 6.2
Assume that for some small constant
. Then for any
there exists a positive constant C such that
with probability greater than
uniformly over
. In which
is constructed in (Equation15
(15)
(15) ).
Following by Tan et al. (Citation2020), in (Equation15
(15)
(15) ) is rate optimal under general loss, projection loss and prediction loss. Moreover, the natural sparse SAVE estimator
and DR estimator
can also be regarded as one optimal estimator for the SAVE directions and DR directions. However, the estimation procedure (Equation15
(15)
(15) ) depends on the unknown sparsity parameter s and is computationally infeasible as it involves exhaustive search over all
subject to the sparsity constraint. Tan et al. (Citation2020) defined a refined sparse SIR estimator based on that SIR can be viewed as transformation-based projection pursuit. Since SAVE and DR cannot be rewritten as a least-square formulation, we do not define refined sparse SAVE and DR estimator.
Disclosure statement
No potential conflict of interest was reported by the author(s).
Additional information
Funding
Notes on contributors
Lu Li
Lu Li is currently a Ph.D student at School of Statistics, East China Normal University.
Xuerong Meggie Wen
Dr Xuerong Meggie Wen is currently an associate professor of Statistics at Dept. of Mathematics and Statistics, Missouri University of Science and Technology.
Zhou Yu
Dr Zhou Yu is a Professor of Statistics at School of Statistics, East China Normal Univercity.
References
- Bondell, H. D., & Li, L. (2009). Shrinkage inverse regression estimation for model–free variable selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 71(1), 287–299. https://doi.org/10.1111/j.1467-9868.2008.00686.x.
- Breiman, L. (1995). Better subset regression using the nonnegative garrote. Technometrics, 37(4), 373–384. https://doi.org/10.1080/00401706.1995.10484371
- Candes, E., & Tao, T. (2007). The Dantzig selector: Statistical estimation when p is much larger than n. The Annals of Statistics, 35(6), 2313–2351. https://doi.org/10.1214/009053606000001523
- Chen, J., Stern, M., Wainwright, M. J., & Jordan, M. I. (2017). Kernel feature selection via conditional covariance minimization. In Advances in Neural Information Processing Systems (pp. 6946–6955).
- Chen, X., Zou, C., & Cook, R. D. (2010). Coordinate-independent sparse sufficient dimension reduction and variable selection. The Annals of Statistics, 38(6), 3696–3723. https://doi.org/10.1214/10-AOS826
- Cook, R. D. (1996). Graphics for regressions with a binary response. Journal of the American Statistical Association, 91(435), 983–992. https://doi.org/10.1080/01621459.1996.10476968
- Cook, R. D. (1998). Regression graphics. Wiley.
- Cook, R. D. (2004). Testing predictor contributions in sufficient dimension reduction. The Annals of Statistics, 32(3), 1062–1092. https://doi.org/10.1214/009053604000000292
- Cook, R. D., & Forzani, L. (2008). Principal fitted components for dimension reduction in regression. Statistical Science, 23(4), 485–501. https://doi.org/10.1214/08-STS275
- Cook, R. D., & Forzani, L. (2009). Likelihood-based sufficient dimension reduction. Journal of the American Statistical Association, 104(485), 197–208. https://doi.org/10.1198/jasa.2009.0106
- Cook, R. D., & Ni, L. (2005). Sufficient dimension reduction via inverse regression: A minimum discrepancy approach. Journal of the American Statistical Association, 100(470), 410–428. https://doi.org/10.1198/016214504000001501
- Cook, R. D., & Weisberg, S. (1991). Sliced inverse regression for dimension reduction: Comment. Journal of the American Statistical Association, 86(414), 328–332. https://doi.org/10.2307/2290564.
- Fan, J., & Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456), 1348–1360. https://doi.org/10.1198/016214501753382273
- Fan, J., & Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(5), 849–911. https://doi.org/10.1111/rssb.2008.70.issue-5 doi: 10.1111/j.1467-9868.2008.00674.x
- Fukumizu, K., Bach, F. R., & Jordan, M. I. (2009). Kernel dimension reduction in regression. The Annals of Statistics, 37(4), 1871–1905. https://doi.org/10.1214/08-AOS637
- Fukumizu, K., & Leng, C. (2014). Gradient-based kernel dimension reduction for regression. Journal of the American Statistical Association, 109(505), 359–370. https://doi.org/10.1080/01621459.2013.838167
- Fung, W. K., He, X., Liu, L., & Shi, P. (2002). Dimension reduction based on canonical correlation. Statistica Sinica, 12(2002), 1093–1113. https://www.jstor.org/stable/24307017.
- Gao, C., Ma, Z., Ren, Z., & Zhou, H. H. (2015). Minimax estimation in sparse canonical correlation analysis. The Annals of Statistics, 43(5), 2168–2197. https://doi.org/10.1214/15-AOS1332
- Gretton, A., Bousquet, O., Smola, A., & Scholkopf, B. (2005). Measuring statistical dependence with Hilbert-Schmidt norms. In International Conference on Algorithmic Learning Theory (pp. 63–77). Springer.
- Li, K. C. (1991). Sliced inverse regression for dimension reduction. Journal of the American Statistical Association, 86(414), 316–327. https://doi.org/10.1080/01621459.1991.10475035
- Li, K. C. (1992). On principal Hessian directions for data visualization and dimension reduction: Another application of Stein's lemma. Journal of the American Statistical Association, 87(420), 1025–1039. https://doi.org/10.1080/01621459.1992.10476258
- Li, K. C. (2000). High dimensional data analysis via the SIR/PHD approach. Lecture Note in Progress.
- Li, L. (2007). Sparse sufficient dimension reduction. Biometrika, 94(3), 603–613. https://doi.org/10.1093/biomet/asm044
- Li, Z., & Dong, Y. (2020). Model free variable selection with matrix-valued predictors. Journal of Computational and Graphical Statistics, 27, 1–11. https://doi.org/10.1080/10618600.2020.1806854
- Li, L., & Nachtsheim, C. J. (2006). Sparse sliced inverse regression. Technometrics, 48(4), 503–510. https://doi.org/10.1198/004017006000000129
- Li, B., & Wang, S. (2007). On directional regression for dimension reduction. Journal of the American Statistical Association, 102(479), 997–1008. https://doi.org/10.1198/016214507000000536
- Li, L., & Yin, X. (2008). Sliced inverse regression with regularizations. Biometrics, 64(1), 124–131. https://doi.org/10.1111/j.1541-0420.2007.00836.x
- Li, R., Zhong, W., & Zhu, L. (2012). Feature screening via distance correlation learning. Journal of the American Statistical Association, 107(499), 1129–1139. https://doi.org/10.1080/01621459.2012.695654
- Lin, Q., Li, X., Huang, D., & Liu, J. S. (2017). On the optimality of sliced inverse regression in high dimensions. arXiv preprint arXiv:1701.06009.
- Lin, Q., Zhao, Z., & Liu, J. (2019). Sparse sliced inverse regression via Lasso. Journal of the American Statistical Association, 114(528), 1726–1739. https://doi.org/10.1080/01621459.2018.1520115
- Lin, Q., Zhao, Z., & Liu, J. S. (2018). On consistency and sparsity for sliced inverse regression in high dimensions. The Annals of Statistics, 46(2), 580–610. https://doi.org/10.1214/17-AOS1561
- Ma, Y., & Zhu, L. (2012). A semiparametric approach to dimension reduction. Journal of the American Statistical Association, 107(497), 168–179. https://doi.org/10.1080/01621459.2011.646925
- Ma, Y., & Zhu, L. (2013a). A review on dimension reduction. International Statistical Review, 81(1), 134–150. https://doi.org/10.1111/insr.2013.81.issue-1 doi: 10.1111/j.1751-5823.2012.00182.x
- Ma, Y., & Zhu, L. (2013b). Efficient estimation in sufficient dimension reduction. The Annals of Statistics, 41(1), 250–268. https://doi.org/10.1214/12-AOS1072
- Ma, Y., & Zhu, L. (2014). On estimation efficiency of the central mean subspace. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(5), 885–901. https://doi.org/10.1111/rssb.2014.76.issue-5 doi: 10.1111/rssb.12044
- Ni, L., Cook, R. D., & Tsai, C. L. (2005). A note on shrinkage sliced inverse regression. Biometrika, 92(1), 242–247. https://doi.org/10.1093/biomet/92.1.242
- Qian, W., Ding, S., & Cook, R. D. (2019). Sparse minimum discrepancy approach to sufficient dimension reduction with simultaneous variable selection in ultrahigh dimension. Journal of the American Statistical Association, 114(527), 1277–1290. https://doi.org/10.1080/01621459.2018.1497498
- Tan, K., Shi, L., & Yu, Z. (2020). Sparse SIR: Optimal rates and adaptive estimation. The Annals of Statistics, 48(1), 64–85. https://doi.org/10.1214/18-AOS1791
- Tan, K. M., Wang, Z., Zhang, T., Liu, H., & Cook, R. D. (2018). A convex formulation for high-dimensional sparse sliced inverse regression. Biometrika, 105(4), 769–782. https://doi-org.ezproxy.uky.edu/10.1093/biomet/asy049.
- Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1), 267–288. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x.
- Wang, H., & Xia, Y. (2008). Sliced regression for dimension reduction. Journal of the American Statistical Association, 103(482), 811–821. https://doi.org/10.1198/016214508000000418
- Wu, Y., & Li, L. (2011). Asymptotic properties of sufficient dimension reduction with a diverging number of predictors. Statistica Sinica, 21(2), 707. https://doi.org/10.5705/ss.2011.v21n2a doi: 10.5705/ss.2011.031a
- Xia, Y., Tong, H., Li, W. K., & Zhu, L. X. (2002). An adaptive estimation of dimension reduction space. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 64(3), 363–410. https://doi.org/10.1111/rssb.2002.64.issue-3 doi: 10.1111/1467-9868.03411
- Yin, X., & Cook, R. D. (2002). Dimension reduction for the conditional kth moment in regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 64(2), 159–175. https://doi.org/10.1111/rssb.2002.64.issue-2 doi: 10.1111/1467-9868.00330
- Yin, X., & Cook, R. D. (2003). Estimating central subspaces via inverse third moments. Biometrika, 90(1), 113–125. https://doi.org/10.1093/biomet/90.1.113
- Yin, X., & Hilafu, H. (2015). Sequential sufficient dimension reduction for large p, small n problems. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 77(4), 879–892. https://doi.org/10.1111/rssb.2015.77.issue-4 doi: 10.1111/rssb.12093
- Yin, X., Li, B., & Cook, R. D. (2008). Successive direction extraction for estimating the central subspace in a multiple-index regression. Journal of Multivariate Analysis, 99(8), 1733–1757. https://doi.org/10.1016/j.jmva.2008.01.006
- Yu, Z., Dong, Y., & Shao, J. (2016). On marginal sliced inverse regression for ultrahigh dimensional model-free feature selection. The Annals of Statistics, 44(6), 2594–2623. https://doi.org/10.1214/15-AOS1424
- Yu, Z., Dong, Y., & Zhu, L. X. (2016). Trace pursuit: A general framework for model-free variable selection. Journal of the American Statistical Association, 111(514), 813–821. https://doi.org/10.1080/01621459.2015.1050494
- Yu, Z., Zhu, L., Peng, H., & Zhu, L. (2013). Dimension reduction and predictor selection in semiparametric models. Biometrika, 100(3), 641–654. https://doi.org/10.1093/biomet/ast005
- Yuan, M., & Lin, Y. (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1), 49–67. https://doi.org/10.1111/rssb.2006.68.issue-1 doi: 10.1111/j.1467-9868.2005.00532.x
- Zhang, C. H. (2010). Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, 38(2), 894–942. https://doi.org/10.1214/09-AOS729
- Zhou, J., & He, X. (2008). Dimension reduction based on constrained canonical correlation and variable filtering. The Annals of Statistics, 36(4), 1649–1668. https://doi.org/10.1214/07-AOS529
- Zhu, L. P., Li, L., Li, R., & Zhu, L. X. (2011). Model-free feature screening for ultrahigh-dimensional data. Journal of the American Statistical Association, 106(496), 1464–1475. https://doi.org/10.1198/jasa.2011.tm10563
- Zhu, L., Miao, B., & Peng, H. (2006). On sliced inverse regression with high-dimensional covariates. Journal of the American Statistical Association, 101(474), 630–643. https://doi.org/10.1198/016214505000001285
- Zhu, L. P., & Zhu, L. X. (2009a). On distribution–weighted partial least squares with diverging number of highly correlated predictors. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 71(2), 525–548. https://doi.org/10.1111/rssb.2009.71.issue-2 doi: 10.1111/j.1467-9868.2008.00697.x
- Zhu, L. P., & Zhu, L. X. (2009b). Nonconcave penalized inverse regression in single-index models with high dimensional predictors. Journal of Multivariate Analysis, 100(5), 862–875. https://doi.org/10.1016/j.jmva.2008.09.003
- Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101(476), 1418–1429. https://doi.org/10.1198/016214506000000735