1,089
Views
1
CrossRef citations to date
0
Altmetric
Review Article

A selective overview of sparse sufficient dimension reduction

, &
Pages 121-133 | Received 19 Jan 2020, Accepted 23 Sep 2020, Published online: 10 Nov 2020

Abstract

High-dimensional data analysis has been a challenging issue in statistics. Sufficient dimension reduction aims to reduce the dimension of the predictors by replacing the original predictors with a minimal set of their linear combinations without loss of information. However, the estimated linear combinations generally consist of all of the variables, making it difficult to interpret. To circumvent this difficulty, sparse sufficient dimension reduction methods were proposed to conduct model-free variable selection or screening within the framework of sufficient dimension reduction. We review the current literature of sparse sufficient dimension reduction and do some further investigation in this paper.

1. Introduction

The rapid development of data collection technology in areas, such as biology, financial econometrics and signal processing, has posed a great challenge for traditional multivariate analysis. High-dimensional data analysis becomes ubiquitous and increasingly important. Dimension reduction, and in particular sufficient dimension reduction for regression, offers an appealing avenue to tackle high-dimensional problems. It is often desirable to reduce the dimensionality of the problem by replacing the original high-dimensional data with a low-dimensional space composed of a few linear combinations of predictors, which are usually much smaller than the original dimension. Although sufficient dimension reduction is an effective way to extract relevant information from high-dimensional data sets, while grasping the important features or patterns in the data, the linear combinations usually consist of all original predictors which makes the interpretation difficult. This limitation can be overcome via variable selection, where a subset of relevant predictor variables is selected. The removal of the excess variables not only can reduce the noise to the precise estimation, alleviate the collinearity issue, but also help reduce the computational cost caused by high-dimensional data.

As one of the most important dimension reduction approaches, many variable selection methods have been developed. Some most popular variable selection approaches are developed under the linear model or the generalised linear model paradigm, such as nonnegative garrotte (Breiman, Citation1995), the least absolute shrinkage and selection operator (Lasso, hereafter) (Tibshirani, Citation1996), the smoothly clipped absolute deviation (SCAD, hereafter) (Fan & Li, Citation2001), adaptive Lasso (Zou, Citation2006), group Lasso (Yuan & Lin, Citation2006), Dantzig selector (Candes & Tao, Citation2007) and the minimax concave plus penalty (MCP, hereafter) (Zhang, Citation2010).

These model-based variable selection methods assume the underlying true model is known up to a finite dimensional parameter or the imposed working model is usefully similar to the true model. However, the true model might be in a complex form and it is usually unknown. If the underlying modelling assumption is violated, these variable selection methods might fail. Hence, model-free variable selection method, which does not require the full knowledge of the underlying true model, is called for. It has been shown that the general framework of sufficient dimension reduction is useful for variable selection (Bondell & Li, Citation2009) since no pre-specified underlying models between the response and the predictors are required. So model-free variable selection can be achieved through the framework of SDR (Cook, Citation1998; Li, Citation1991Citation2000).

Let X=(X1,,Xp) be the predictor and Y be the scalar response. The goal of variable selection is to seek the smallest subset of the predictors XA, with partition X=(XA,XAc), such that (1) YboxhUXAc|XA.(1) Here A denotes a subset of indices of {1,,p} corresponding to the relevant predictor set XA, and Ac is the complement of A, i.e., XA={xi:iA} and XAc={xi:iAc}. Condition (Equation1) implies that XA contains all the active predictors in terms of predicting Y. The existence and uniqueness of A were discussed in details in Yin and Hilafu (Citation2015). Ideally, we want to find the smallest index set A satisfying (Equation1), in which case no inactive predictors are included in XAc.

Model-free variable selection is closely related to sufficient dimension reduction, which aims to find βRp×d with dp, such that (2) YboxhUX|βX,(2) that is, Y is independent of X conditioning on βX. The column space of such β, Span(β), is called a dimension reduction space. Under mild assumptions, such as given in Cook (Citation1996) and Yin et al. (Citation2008), the intersection of all such spaces is itself a dimension reduction space. In this case, we call the intersection the central subspace for the regression of Y on X, and denote it by SY|X. And its dimension, d=dim(SY|X), is usually much smaller than p, the dimension of the original predictor. Following the partition of X, we can partition β accordingly as β=βAβAc,βAR|A|×d,βAcR(p|A|)×d, where |A| is the cardinality of A. Hence, (Equation1) is equivalent to βAc=0.

Many methods have been proposed for estimating the basis of SY|X in the literature, including sliced inverse regression (SIR, hereafter) (Li, Citation1991), sliced average variance estimation (SAVE, hereafter) (Cook & Weisberg, Citation1991), principal Hessian directions (PHD, hereafter) (Li, Citation1992), minimum average variance estimation (MAVE, hereafter) (Xia et al., Citation2002), directional regression (DR, hereafter) (Li & Wang, Citation2007), principal fitted component (PFC, hereafter) (Cook & Forzani, Citation2008), semiparametric approach (Ma & Zhu, Citation2012), etc. Several methods have been also suggested for simultaneously selecting the contributing predictors. These include shrinkage SIR (Ni et al., Citation2005), sparse SIR (Li, Citation2007; Li & Nachtsheim, Citation2006), sparse SAVE and sparse PHD (Li, Citation2007), constrained canonical correlation (Zhou & He, Citation2008), the general shrinkage strategy for inverse regression estimation (Bondell & Li, Citation2009), the regularised SIR estimator with SCAD penalty (Wu & Li, Citation2011) and coordinate independent sparse estimation (CISE, hereafter) (Chen et al., Citation2010), conditional covariance minimisation (Chen et al., Citation2017), etc.

Although these aforementioned methods can select the significant predictors without assuming an underlying parametric model, they are not designed for pn problems, in which the number of predictor variables is larger than the number of observations. The so-called large p small n problems are increasingly common with rapid technological advances in data collection and have attracted a lot of research interests. We hereby give a very brief review of model-free variable selections via sufficient dimension reduction approach under the pn setting. Li and Yin (Citation2008) proposed sparse ridge SIR, which combined SIR with both 1- and 2-regularisation to achieve dimension reduction and variable selection simultaneously, even when p>n. Yu et al. (Citation2013) suggested combining SIR with the Dantzig selector (Candes & Tao, Citation2007) to recover the central subspace in the general semiparametric models. A non-asymptotic error bound for the resulting estimator is derived and the error bound is of order Op((logp/n)1/2), which appears to be optimal. Moreover, they proposed another regularised version of SIR with the adaptive Dantzig selector. The resulting estimators defined from variable selection are asymptotically normal even when the predictor dimension diverges to infinity. It is worth mentioning that the |A| is fixed in Yu et al. (Citation2013). Yu, Dong, Zhu (Citation2016) proposed trace pursuit for model-free variable selection under the sufficient dimension reduction paradigm. Two distinct algorithms are proposed: stepwise trace pursuit (STP, hereafter) and forward trace pursuit (FTP, hereafter). Stepwise trace pursuit achieved selection consistency with fixed p and is applicable in the setting with p>n. Furthermore, forward trace pursuit can serve as an initial screening step to speed up the computation in the case of ultrahigh dimensionality. Li and Dong (Citation2020) extended trace pursuit method to matrix-valued predictors based on Yu, Dong, Zhu (Citation2016). To test the importance of rows, columns and submatrices of the predictor matrix in terms of predicting the response, three types of hypotheses are formulated under a unified framework. The asymptotic properties of the test statistics under the null hypothesis are established and a permutation testing algorithm is also introduced to approximate the distribution of the test statistics. Tan et al. (Citation2018) developed a convex formulation for fitting sparse SIR in high dimensions. They solved the resulting convex optimisation problem via the linearised alternating direction methods of multiple algorithms and established an upper bound on the subspace distance between the estimated and the true subspaces. Unlike Yu et al. (Citation2013), Lin et al. (Citation2019) allowed |A| goes to infinity. By constructing artificial response variables made up from top eigenvectors of the estimated conditional covariance matrix, Lin et al. (Citation2019) introduced a simple Lasso regression method to obtain an estimator of the sufficient dimension reduction space. The resulting algorithm, Lasso-SIR, is shown to be consistent and achieves the optimal convergence rate under certain sparsity conditions when p is of order o(n2c2), where c is the generalised signal-to-noise ratio, which is only the first step of Tan et al. (Citation2020). Moreover, Tan et al. (Citation2020) discovered the possible trade-off between statistical guarantee and computational performance for sparse SIR and proposed an adaptive estimation scheme for sparse SIR which is computationally tractable and rate optimal under the condition that logp=o(n), which is weaker than Lin et al. (Citation2019).

There is considerable literature on applying sufficient dimension reduction for model-free selection, but the study of developing screening consistency for the ultra-high dimensional setting is still lacking. To fulfil the aforementioned gaps, Zhu et al. (Citation2011) proposed a variable screening procedure under a unified model framework, which contains a wide variety of commonly used parametric and semiparametric models. The new method does not require imposing a specific model structure on regression functions and thus is particularly appealing to ultrahigh-dimensional regressions. They also showed the proposed method achieves screening consistency even with the number of predictors growing at an exponential rate of the sample size. Yu, Dong, Shao (Citation2016) proposed an approach called marginal SIR for model-free variable selection. Furthermore, marginal SIR with Dantzig selector exploits the sparsity structure in the marginal utility and achieves the desirable selection consistency property. Lin et al. (Citation2017) first introduced a large class of models depending on the smallest non-zero eigenvalue of the kernel matrix of SIR, then the minimax rate for estimating the central space is derived, which is the first paper studied the minimax estimation of sparse SIR. However, they only considered the projection loss (Li & Wang, Citation2007). More importantly, their theoretical study is based on the assumption that the covariance matrix is diagonal. As far as we know, most of mentioned work mainly focus on SIR with consistency on variable selection. Qian et al. (Citation2019) provided simultaneous analysis for PFC and SAVE. Furthermore, their approach allows many quantities such as the structural dimension, the number of important predictors and the number of slices to diverge with n. To deliver the most essential messages, in the following section, we focus our discussion on the papers mentioned above.

2. Review of sufficient dimension reduction

Sufficient dimension reduction aims to find the column space of β with the smallest dimension d. In other words, sufficient dimension reduction is proposed as a problem of estimating a space, instead of the classic statistical problem of estimating parameters. As mentioned in the introduction, there are many approaches in the literature of sufficient dimension reduction for estimating the column space β: sliced inverse regression (SIR; Li, Citation1991), sliced average variance estimation (SAVE; Cook & Weisberg, Citation1991), minimum average variance estimation (MAVE; Xia et al., Citation2002), the kth moment estimation (Yin & Cook, Citation2002Citation2003), inverse regression (Cook & Ni, Citation2005), directional regression (DR; Li & Wang, Citation2007), sliced regression (SR; Wang & Xia, Citation2008), likelihood acquired directions (LAD; Cook & Forzani, Citation2009), semiparametric approaches (Ma & Zhu, Citation2012Citation2013aCitation2013bCitation2014), etc. We mainly review three inverse regression-based methods (SIR; SAVE and DR) for estimating SY|X for our subsequent investigation.

Inverse regression methods constitute the oldest class of dimension reduction methods and are still under active development currently. The main idea of the inverse regression is to reverse the relation between the response and the predictors (Li, Citation1991). Instead of considering distributions or expectations of functions of Y conditional on X, which suffers the curse of dimensionality when X is high dimensional, these inverse regression-based methods consider expectations of functions of X conditional on Y, which is suddenly a low dimensional problem because Y is univariate. The inverse regression-based methods often are based on some additional assumptions on the predictors to link the low dimensional problem and the original high dimensional problem. These additional assumptions are given as follows.

(W1)

linearity condition E(X|βX)=Σβ(βΣβ)1βX.

(W2)

constant variance condition cov(X|βX)=ΣΣβ(βΣβ)1βΣβ(βΣβ)1βΣ,

where Σ=cov(X). As is known to all, SIR only requires the condition (W1) holds. However, SAVE and DR need both conditions.

When the linearity condition and the constant variance condition are satisfied, the inverse regression methods formulate the problem of estimating SY|X into an eigen-decomposition problem. Let M be the kernel matrix of a specific inverse regression based dimension reduction method. For the sufficient dimension reduction methods that aim to estimate SY|X, the kernel matrices corresponding to the three most well-known inverse regression methods are summarised as below: SIR: MSIR=var{E(X|Y)};SAVE: MSAVE=E{Σvar(X|Y)}2;DR: MDR=2E2{E(X|Y)E(X|Y)}+2E{E(X|Y)E(X|Y)}E{E(X|Y)E(X|Y)}+2E{E2(XX)}2Σ. Assuming d=SY|X is known, the procedure for a generalised eigenvalue-decomposition of the kernel matrix M, that is Mβi=λiΣβi,with βiΣβj=1if i=j,βiΣβj=0else ij, where i=1,,p, and λ1λd>0=λd+1==λp are the eigenvalues. Then the eigenvectors corresponding to the nonzero eigenvalues β=(β1,,βd) form a basis of SY|X. Thus the sufficient dimension reduction directions β can also be identified through the following optimisation problem (Tan et al., Citation2020): (3) βˆ=argmaxBRp×dTr(BMB) s.t. BΣB=Id.(3)

3. The current literature of variable selection via sufficient dimension reduction

3.1. Oracle property under the setting p<n

In the general framework of condition (Equation1), the shrinkage SIR method is developed in Ni et al. (Citation2005) by applying the Lasso approach to SIR. When a subset of predictors are irrelevant, then the corresponding row estimates of β is equal to 0, and consequently to achieve variable selection. Let α=(α1,,αp), with αiR, i=1,,p, be the shrinkage vector. Then based on the expression (Equation3), the estimation of the shrinkage vector can be rewritten to minimise the following function over α (Ni et al., Citation2005): (4) αˆ=argminαn(Vec(βˆ)Vec{diag(α)βˆ})Mˆ(Vec(βˆ)Vec{diag(α)βˆ}),subject to i=1p|αi|t,t0,(4) where Mˆ is the estimator of kernel matrix M. To investigate the asymptotic behaviour, we consider the Lagrangian formulation of the constrained optimisation problem. Specially, the optimisation problem in expression (Equation4) can be reformulated as αˆ=argminα(||UWα||2+τni=1p|αi|), for some non-negative penalty constant τn. In which, U=n1/2Mˆ1/2Vec(βˆ),W=n1/2Mˆ1/2βˆ. Then the central dimension reduction subspace SY|X is estimated by Span{diag(αˆ)βˆ}. Li (Citation2007) extended shrinkage SIR method to SAVE and PHD methods, where the central dimension subspace is estimated the same as Ni et al. (Citation2005), and βˆ corresponds to the estimated central dimension reduction directions of SAVE and PHD methods, respectively. Bondell and Li (Citation2009) proposed a general shrinkage estimation strategy for the entire inverse regression estimation family that is capable of simultaneous sufficient dimension reduction and variable selection. They considered the adaptive Lasso, αˆ=argminα||UWα||2+τni=1pwi|αi|, where w=(w1,,wp) is a known weights vector. They also demonstrated that the proposed class of shrinkage estimators has the desirable oracle property of consistency in variable selection while retaining root n estimation consistency.

However, most existing sparse dimension reduction methods mentioned above are conducted stepwise, estimating a sparse solution for a basis matrix of the central subspace column by column. Instead, Chen et al. (Citation2010) proposed a unified one-step approach to reduce the number of variables appearing in the estimate of SY|X. Their approach, which depends operationally on Grassmann manifold optimisation, can achieve dimension reduction and variable selection simultaneously. Additionally, their proposed method has the oracle property: under mild conditions, the proposed estimator would perform asymptotically as well as if the true irrelevant predictors were known. More importantly, Chen et al. (Citation2010) is an extension to Bondell and Li (Citation2009), which combined SIR, SAVE, DR with adaptive Lasso to variable selection. Zhou and He (Citation2008) proposed a constrained canonical correlation procedure (C3) based on imposing the L1-norm constraint on the effective dimension reduction estimates in CANCOR, followed by a simple variable selection method. Using the B-spline basis functions generated for the response variable, the CANCOR method (Fung et al., Citation2002) is asymptotically equivalent to SIR. Suppose that the range of Y is a bounded interval [a,b], given kn interval knots in [a,b] and the spline order m, we generate m+kn B-spline basis functions. Under the linearity condition, CANCOR estimates a set of effective dimension reduction directions by estimating the canonical variates between the B-spline basis functions and X. Since the generated m+kn B-spline basis functions add to 1, we use in CANCOR the first m+kn1 basis function of Y, π(Y)=(π1(Y),,πm+kn1(Y)). Let X=(X1,,Xn) and Πn×(m+kn1)=(π(Y1),,π(Yn)) be the data matrices containing the predictor values and the B-spline basis function values. Then the CANCOR method is to estimate the canonical correlations between the columns of X and the columns of Π. The dimensionality of the central dimension reduction subspace is selected by performing the following sequential tests on the number of the non-zero canonical correlations, H0:γs>γs+1=0 versus H1:γs+1>0 for s=0,1,,p1, where γs are the asymptotic canonical correlations between π(Y) and X in decreasing order. The dimensionality estimate for d is the smallest s such that H0 is not rejected. The CANCOR method actually solves an optimisation problem that sequentially finds the directions β with the maximum correlation between βX and some functions of Y. Their procedure is attractive because they demonstrated that it also has the oracle property.

Sparse sufficient dimension reduction methods mentioned above focus on the cases when p is fixed. For regressions with diverging p, estimation and variable selection methods are also developed in the framework of sufficient dimension reduction: Zhu et al. (Citation2006) studied the asymptotic properties of SIR as p diverges, but their result is for SIR only, and variable selection is not studied at all. Zhu and Zhu (Citation2009a) investigated weighted partial least squares with a diverging p, but again variable selection is not derived. Zhu and Zhu (Citation2009b) investigated variable selection with a diverging number of predictors through inverse regression, but focused on single-index models only. By contrast, Wu and Li (Citation2011) established asymptotic properties for a family of inverse regression estimators that includes SIR, studied simultaneous dimension reduction and variable selection with a particular emphasis on the latter and encompassed more general forms, while the number of predictors p is allowed to diverge as the sample size n approaches infinity. Wu and Li (Citation2011) adopted the SCAD type penalty that was first introduced by Fan and Li (Citation2001), and combined it with sufficient dimension reduction estimator, that is αˆ=argminα||UWα||2+i=1ppτn(|αi|). The penalty pτn() are not necessarily the same for all i. Wu and Li (Citation2011) also showed that the penalised estimator selects all truly contributing predictors and excludes all irrelevant ones with probability approaching one.

Based on the work in kernel dimension reduction, Chen et al. (Citation2017) proposed a method to perform feature selection via a constrained optimisation problem. The corresponding SDR method can refer to Fukumizu et al. (Citation2009); Fukumizu Leng (Citation2014). Many previous kernel approaches are filter methods based on the Hilbert–Schmidt Independence Criterion (HSIC, Gretton et al., Citation2005). Chen et al. (Citation2017) proposed to use the trace of the conditional covariance operator as a criterion for feature selection. Let (H1,k1) denote an RKHS supported on XRp. Then the trace of the conditional covariance operator, trace(ΣYY|X) can be interpreted as a dependence measure, as long as the H1 is large enough. Then the problem of supervised feature selection reduces to minimising the trace of the conditional covariance operator over subsets of features with controlled cardinality: minT:|T|=dQ(T):=Tr(ΣYY|XT). They also showed that empirical estimate of the criterion is consistent as the sample size increases. It is worth noting that kernel feature selection methods have the advantage of capturing nonlinear relationships between the features and the labels.

Theorem 3.1

Assume n1/2{Vec(β)ˆVec(β)}N(0,Γ), for some Γ>0, and that Mn1/2=M1/2+o(1n). Suppose that τn and τnn0 with p<n, then the shrinkage estimator βˆ satisfies

  1. consistency in variable selection, Pr(Aˆ=A)1, and

  2. asymptotic normality, n1/2{Vec(βˆA)Vec(βA)}N(0,Λ), for some Λ>0.

Remark 3.1

Theorem 3.1, part (a), indicates that the sparse sufficient dimension reduction estimator can select contributing predictors consistently, i.e., for all iA we have Pr(αˆi0)0, and for all iA we have Pr(αˆi0)1. Theorem 3.1, part (b), further shows that the estimator for βA that corresponds to the contributing predictors is root n consistent. The oracle property as shown in Theorem 3.1 is given in Bondell and Li (Citation2009), Chen et al. (Citation2010), Wu and Li (Citation2011) and Zhou and He (Citation2008). Most of the methods mentioned above cannot achieve the desired property with p>n, however, Wu and Li (Citation2011) showed that their proposed method can obtain selection consistency when p diverge as the sample size n goes to infinity. Then we turn to investigate the oracle property with p>n.

3.2. Oracle property under the setting pn

Large-p-small-n problems appear frequently in fields such as biology, economics and finance. While those variable selection methods have been successfully applied in many high-dimensional analyses, modern applications in areas such as genomics and high-frequency finance further push the dimensionality of data to an even larger scale, where p may grow exponentially with n. Such ultrahigh-dimensional data present simultaneous challenges of computational expediency, statistical accuracy and algorithm stability. It is difficult to directly apply the aforementioned variable selection methods to those ultrahigh-dimensional statistical learning problems due to the computational complexity inherent in those methods. To reduce the predictor dimension in semiparametric regressions, Yu et al. (Citation2013) proposed a ρ1-minimisation of SIR with the Dantzig selector (Candes & Tao, Citation2007), which is defined as (5) min||ηk||1,(l=1,,k1)such that ||MˆηkνkΣˆηk||ζk,|ηkΣˆηk1|ζk,|ηkΣˆηl|ζk,(5) where k=1,,d, νk=ηkMˆηk, |η|1=i=1p|ηi| and η0 is a p×1 zero vector. Furthermore, they established a non-asymptotic error bound for the resulting estimator when |A| is fixed. Yu et al. (Citation2013) also extended the regularisation concept to SIR with an adaptive Dantzig selector, which is defined by (6) min||Wkηk||1,such that ||Wk1(Mˆβˆk0λˆk0Σˆηk)||ζk,(6) where Wk=diag(wk1,,wkp) is the a known weight matrix and wkj is a specified positive value, which should vary inversely with the magnitude of βˆkj0. Yu et al. (Citation2013) proposed a two-step estimation procedure to select the contributing predictors. In the first step, they screened out informative predictors based on (Equation5). This is called Dantzig selector based SIR. In the second step, they enhance the sparsity and the estimation efficiency with (Equation6), based on the predictors selected in the first step, called iterative adaptive Dantzig selector based SIR. This ensures that all contributing predictors are selected with high probability and that the resulting estimator is asymptotically normal even when the predictor dimension diverges to infinity.

However, there is a gap between the optimisation problem and the theoretical results: there is no guarantee that the estimator obtained from solving the proposed biconvex optimisation problem is the global minimum. Most existing work in the high-dimensional sufficient dimension reduction literature involves nonconvex optimisation problems. Moreover, they seek to estimate a set of reduced predictors that are not identifiable by definition, rather than the central subspace. Yin and Hilafu (Citation2015) proposed a sequential approach for estimating high-dimensional SIR. Both proposals are stepwise procedures that do not correspond to solving a convex optimisation problem. Moreover, as discussed in Yin and Hilafu (Citation2015), theoretical properties for their proposed estimators are hard to establish due to the sequential procedure used to obtain the estimators. In the high-dimensional setting, Lin et al. (Citation2018) proposed a screening approach to perform variable selection and established an error bound for the estimators, which allows |A| goes to infinity. The selected variables are then used to fit classic SIR. Furthermore, the resulting algorithm is shown to be consistent and achieved the optimal convergence rate under certain sparsity conditions when p is of order o(n2c2), where c is the generalised signal-to-noise ratio. Tan et al. (Citation2018) proposed a convex formulation for sparse SIR in the high-dimensional setting by adapting techniques from the sparse canonical correlation analysis. Their proposal estimates the central subspace directly and performs variable selection simultaneously. Moreover, the proposed method can be adapted for sufficient dimension reduction methods that can be formulated as generalised eigenvalue problems.

As mentioned in introduction, most literature mainly focus on SIR with consistency on variable selection. Qian et al. (Citation2019) proposed methods under a unified minimum discrepancy framework with regularisation. Consistency results in both central subspace estimation and variable selection are established simultaneously for some famous SDR methods, including SIR, PFC and SAVE. More importantly, their approach allows many quantities such as the structural dimension, the number of important predictors and the number of slices to diverge with n. Unlike many high-dimensional SDR methods, their method did not necessarily require a sparsity condition on the predictor covariance matrix or the maximum eigenvalue of the predictor covariance matrix to be upper bounded. Furthermore, they developed a new algorithm that can efficiently solve a general class of high-dimensional sparse minimum discrepancy problems.

Many SDR methods can be rewritten as a minimisation problem using an objective function of the form (7) L1n(Γ,V)=tr(γnΣnΓV)Ωn(γnΣnΓV),(7) where γn, Σn and Ωn are sample estimates for the population matrices M~, W and W1. Here, M~ is a p×l kernel matrix associated with a particular SDR method, where dlp, W is some p×p positive definite matrix, ΓRp×d and VRd×l represent parameters to be estimated by minimisation of L1n. The general form of (Equation7) is an adaptation of the minimum discrepancy approach proposed by Cook and Ni (Citation2005). To identify the correct sparsity structure of SY|X under pn scenarios, Qian et al. (Citation2019) proposed to adopt coordinate-independent regularisation approach and imposed the penalty PV(Γ) with tuning parameter λn on (Equation7) under the alternative constraint VV=Id, given the objective function L2n(Γ,V)=12tr(γnΣnΓV)Ωn(γnΣnΓV)+λnPV(Γ),subject to VV=Id. Given its minimiser (Γˆ,Vˆ)=argminΓ,VL2n(Γ,V), they simultaneously estimated SY|X by Span(Γ) and estimated A0 by Aˆ0={1jp:ejΓˆΓˆej>0}.

Tan et al. (Citation2020) considered four loss functions

  1. General loss. LG(βˆ,β)=||βˆβˆββ||;

  2. Projection loss. LP(βˆ,β)=||βˆ(βˆβˆ)1βˆβ(ββ)1β||F2;

  3. Prediction loss. LX(βˆ,β)=infWRp×p||Σ1/2(βˆβW)||F2;

  4. Correlation loss. LC(βˆ,β)=11dTr[(βˆΣβˆ)1(βˆΣβ)(βΣβ)1(βΣβˆ)],

where ||||F denotes the Frobenius norm of a matrix. Further, Tan et al. (Citation2020) established the minimax lower bound for sparse SIR under general loss, projection loss and prediction loss. They proposed natural sparse SIR estimator and proved that the upper error bound associated with all four loss functions can match the minimax lower bound obtained, which implies that it is a rate-optimal estimator for sparse SIR. However, this optimal estimation is computational intractable. Then they developed the computational feasible counterpart for this natural sparse SIR estimator through convex relaxation. But their theoretical investigation suggested that such computational realisation for natural sparse SIR estimator cannot maintain the optimal estimation rate.

To further address this issue, they proposed a refined sparse SIR estimator. The refined sparse SIR estimator is also rate-optimal yet computational intractable. However, its computational feasible counterpart based on the adaptive estimation procedure is proven to be nearly rate-optimal. Compared to the Lasso-SIR (Lin et al., Citation2019), which was shown to be rate optimal only when p=o(n2), their sparse SIR approach is rate optimal even when logp=o(n). Therefore, their proposed sparse SIR estimator certainly enjoys a much wider range of applications. The reason why Lasso-SIR fails to work when logp=o(n) is that it requires the estimation of the eigenvalues and eigenvectors of the p×p non-sparse SIR kernel matrix. It is well known that the sample eigenvalues and eigenvectors are not even consistent when p/n has a nonzero limit as n. In summary, the minimax lower bound obtained, the two rate-optimal yet computational infeasible estimators, the two corresponding computational tractable counterparts, and the theoretical upper bound of the four estimators under four-loss functions together provide a thorough understanding of sparse SIR. It is also worth noting that Lin et al. (Citation2019) is just the first step of Tan et al. (Citation2020). Bondell and Li (Citation2009) demonstrated that Supp(β)=A, then the sparse representation of SIR relies on |A|, the number of truly relevant predictors, where Supp(β) denotes the support of β. Assuming |A|s, sparse SIR is further defined through seeking β such that (8) β=argmaxBRp×dTr(BMB)s.t. BΣB=Id and |Supp(B)|s.(8) The above formulation of sparse SIR enjoys a similar fashion as that of sparse CCA (Gao et al., Citation2015). To get theoretical results, the following conditions are required.

(A1)

the conditional mean E{(XE(X)|βX)} is linear in X;

(A2)

|Ak| is bounded for k=1,,d;

(A3)

the nonzero eigenvalues λ1,,λd are distinct;

(A4)

there exists a positive constant a0 such that 0<a0<1/4, logp/na0 and E(exp[t{XiE(Xi)}2])K<, for i=1,,p, and all |t|a0;

(A5)

D0=max1ipj=1p|σij| is bounded constant as p;

(A6)

the restricted isometry and restricted orthogonality constants δ2SkAk and θSk,2SkAK satisfy δ2SkAk+θSk,2SkAK<1, where Ak=MλkΣ.

See Yu et al. (Citation2013) for more details.

Theorem 3.2

Suppose that Conditions A1–A6 are satisfied, and ζk=C0(logp/n)1/2. Then ||βˆkβk||24C2SK(1δ2SkAkθS,2SAK)2logpn, with a probability greater than 158pτ for some τ greater than (logp)1log58, where Ak, δ2SkAk and θS,2SAK are defined in Condition A6.

Remark 3.2

Theorem 3.2 suggests that a small price can obtain a sparse solution, as the squared estimation error of the regularised estimation is optimal up to a factor of logp. The consistency property of Lin et al. (Citation2018), Tan et al. (Citation2018) and Tan et al. (Citation2020) are similar with Theorem 3.2, but the threshold for ||βˆkβk||2 can be different.

4. The current literature of variable screening

Although there is a vast literature of applying sufficient dimension reduction for model-free selection, the result of developing screening consistency for the ultra-high dimensional setting is scant. Therefore many scholars are concentrated on investigating methods to achieve screening consistency.

4.1. Marginal utility

Yu, Dong, Shao (Citation2016) proposed an approach called marginal SIR for model-free variable selection. Since M contains all the regression information between Y and X, Yu, Dong, Shao (Citation2016) considered the diagonal element of M as the marginal utility for the corresponding predictor. Specially, let ek be the standard unit vector in Rp with 1 being the kth element and 0 otherwise. They considered the following utility for Xk: (9) mk=ekΣ1MΣ1ek.(9) Yu, Dong, Shao (Citation2016) refer to mk as the population level marginal SIR utility. To apply Dantzig selector for the estimation of the marginal SIR utility mk, they defined p=E{𝟙(YJ)}, =1,,H. Let μ=E{X𝟙(YJ)}. Then MSIR==1HpE(X|YJ)E(X|YJ) can be written as MSIR==1Hμμ/p. Therefore mkSIR=ek=1Hνν/pek,ν=Σ1μ. The marginal utility mk is estimated by mˆkSIR=ek=1Hνˆνˆ/pˆek, where Σˆ=i=1nXiXi/n and μˆ=i=1nXi𝟙(YiJ)/n. For a given threshold bn, the active set A is estimated by including the predictors such that mˆkSIR exceeds bn or Aˆ={k:mˆkSIRbn}. Yu, Dong, Shao (Citation2016) take an example that X=(X1,,Xp)N(0,Σ). Let var(Xi)=1, cov(Xi,Xj)=0.6 for |ij|=1, and cov(Xi,Xj)=0 for |ij|>1, 1i,jp. Let Y=βX+ε, where εN(0,1) is independent of X and β=(1.2,2,0.,0). Then the active set for the linear regression models is A={1,2}. Consider five utilities for X1: the marginal absolute Pearson correlation from Fan and Lv (Citation2008), the marginal squared distance correlation utility from Li et al. (Citation2012), the marginal fused Kolmogorov filter utility as defined in (5.3) of Yu, Dong, Shao (Citation2016), the marginal independence SIR utility as defined in (5.1) of Yu, Dong, Shao (Citation2016) and the marginal SIR utility as defined in (Equation9). Unfortunately, the first four independence screening methods will fail to recover the active predictor X1, only marginal SIR achieves desired result.

4.2. Trace pursuit

Yu, Dong, Zhu (Citation2016) proposed trace pursuit as a novel approach for model-free variable selection. They first extended the classical stepwise regression in linear models and proposed an STP algorithm for model-free variable selection. Furthermore, they proposed the FTP algorithm. After finding a solution path by adding one predictor into the model at a time, a modified Bayesian information criterion (BIC, hereafter) provides a chosen model that is guaranteed to include all important predictors. Finally, the two-stage trace pursuit algorithm uses FTP for initial variable screening.

For working index set F and index jFc, if we want to test (10) H0:YboxhUXj|XF vs. Ha:Y is not independent of Xj given XF.(10) For any index set F, denote XF={Xi:iF}, var(XF)=ΣF. Taking SIR as an example, denote MSIR=ΣF1/2MSIRΣF1/2. Recall that A denotes the active index set satisfying YboxhUXAc|XA, and I={1,,p} denotes the full index set. It is worth noting that, if the assumption (W1) holds true, then for any index set F such that AFI, tr(MASIR)=tr(MFSIR)=tr(MSIR). It suggests that tr(MFSIR) can be used to capture the strength of relationship between Y and XF. Denote Fj as the index set of j together with all the indices in F. Given that XF is already in the model, then trace difference tr(MFjSIR)tr(MFSIR) can be used to test the contribution of the additional variable Xj to Y. The idea of using trace difference is similar to the extra sums of squares test in the classical multiple linear regression setting. The following subset LCM assumption is required in Yu, Dong, Zhu (Citation2016), (11) E(Xj|XF) is a linear function of XF for any FI and jFc.(11) Furthermore, they also provided the STP algorithm, that is

  1. Initialisation. Set the initial working set to be F=.

  2. Forward addition. Find index aF such that aF=argmaxjFctr(MˆFjSIR). If TaF|FSIR=n{tr(MˆFaFSIR)tr(MˆFSIR)}>cSIR, update F to be FaF.

  3. Backward deletion. Find index dF such that dF=argmaxjFtr(MˆFjSIR). If TdF|FdFSIR=n{tr(MˆFSIR)tr(MˆFdFSIR)}<cSIR, update F to be FdF.

  4. Repeat steps (b) and (c) until no predictors can be added or deleted.

The test for SAVE and DR can be defined in a parallel fashion if the following CCV assumption together with the subset LCM (Equation11) assumption holds true var(Xj|XF) is nonrandom for any FI and jFc. Li and Dong (Citation2020) had a recent extension of trace pursuit to matrix-valued predictors. Suppose the response variable YR and the predictor XRp×q have the following general relationship: (12) Y=g(X)+ε,(12) where g: Rp×qR is an unknown function, ϵ is independent of X, and E(ε)=0. Assume that X follows the matrix normal distribution, which is denoted as XNp,q(μ,U,V) with μRp×q, URp×p and VRq×q. Then, the row covariance matrix is U=E{(Xμ)(Xμ)}/tr(V), and the column covariance matrix is V=E{(Xμ)(Xμ)}/tr(U).

Let Irow={1,,p} be the full index set of rows and Xj, be the jth row of X for j=1,,p. Define the active row set A as A={jIrow:Y depends on Xj, in model (12)}. Similarly, let Icol={1,,q} be the full index set of columns and X,k be the kth column of X for k=1,,q. Define the active column set B as B={kIcol:Y depends on X,k in model (12)}. Based on the active row and column predictors, model (Equation12) can be expressed as Y=g(XA,B)+ε, where g:R|A|×|B|R with || denoting the cardinality of a set, and XA,B denotes the submatrix of X that contains the active rows indexed by A and the active columns indexed by B. Note that Y depends on X only through XA,B. Li and Dong (Citation2020) introduced procedures to recover the active row set A in detail. Let Xj,, j=1,,p, be the jth row of X and Xj,R(p1)×q be the matrix that includes all but the jth row of X. To test the importance of Xj,, they considered the following row hypotheses: (13) H0,{j}row:YboxhUX|Xj, vs. Ha,{j}row:Y is not independent of X given Xj,.(13) Under the null hypothesis, H0,{j}row, the response Y depends on X only through Xj,. In the special case of q = 1, X becomes a p-dimensional vector, and (Equation13) is equivalent to testing the importance of one component of X given the other p−1 predictors. This special case is known as the marginal coordinate test (Cook, Citation2004). Let Uj,jR(p1)×(p1) be the submatrix of U that excludes the jth row and the jth column of U. Define the following quantity: δjrow=tr(M)tr(Mj,), where M=U1E(XY)V1E(XY) and Mj,=Uj,j1E(Xj,Y)V1E(Xj,Y). This trace difference δjrow is the key quantity to test the importance of the jth row of X, which is same as Yu, Dong, Zhu (Citation2016). Note that δjrow=0 under H0,{j}row.

To develop the screening consistency for ultrahigh dimensional setting, Zhu et al. (Citation2011) proposed a novel variable screening procedure under a unified model framework, which covers a wide variety of commonly used parametric and semiparametric models. They assumed that E(Xi)=0 and var(xi)=1 for i=1,,p and Ω(Y)=E{XF(Y|X)} for ease of explanation. It then follows by the law of iterated expectations that Ω(Y)=cov{X,𝟙(Y<y)}. Let Ωi(Y) be the ith element of Ω(Y), and defined as ωi=E{Ωi2(Y)},,i=1,,p. Then ωi is to serve as the population quantity of our proposed marginal utility measure for predictor ranking. Intuitively, one can see that, if xi and Y are independent, then xi and the indicator function 𝟙(Yy) change independently. Consequently, ωi=0. On the other hand, if xi and Y are related, then ωi must be positive. For ease of presentation, they assumed that the sample predictors are all standardised; that is, n1j=1nXji=0 and n1j=1nXji2=1 for i=1,,p. A natural estimator of ωi is ω~i=1nj=1n1nk=1nXki𝟙(Yk<Yj)2,i=1,,p, where Xki denotes the kth element of xi. The new method does not require imposing a specific model structure on regression functions, and thus is particularly appealing to ultrahigh-dimensional regressions. They showed that, with the number of predictors growing at an exponential rate of the sample size, the proposed procedure possesses consistency in ranking, which is both useful in its own right and can lead to consistency in selection. Lin et al. (Citation2017) first introduced a large class of models depending on the smallest non-zero eigenvalue λ of the kernel matrix of SIR, then the determination of the minimax rate for estimating the central space over two classes is derived, which is the first paper that studied the minimax estimation of sparse SIR. Furthermore, they showed that the estimator based on the SIR procedure converges at rate d((sd+slog(ep/s))/(nλ)), which is the optimal rate for the single index models and multiple index models with fixed structural dimension d, fixed s=|A| and λ. However, Lin et al. (Citation2017) only considered the projection loss (Li & Wang, Citation2007). More importantly, their theoretical study is actually based on the assumption that covariance matrix is diagonal.

Before discussing the consistency property, we need some conditions. Taking Yu, Dong, Shao (Citation2016) as an example,

(C1)

The coverage condition: Span{ΣE(X|YJ)=1,,H}=SY|X.

(C2)

There exist 0<c<1/4 and 0<q< such that E{exp(tXk)}q for all |t|c, k=1,,p. In addition, there exist positive constants λmin and λmax such that 0λminλmin(Σ)λmax(Σ)λmax<, where λmin(Σ) and where λmax(Σ) are the smallest and largest eigenvalue of Σ, respectively.

(C3)

There exists 0<f< such that ||Σ1||1f.

(C4)

There exists 0<g<12×r such that f2s2logp=Op(ng), where s is the cardinality of A and r is specified in condition (C5).

(C5)

There exists 0<a2< and r1/2 such that minkAmk>2a2nr.

More details please refer to Yu, Dong, Shao (Citation2016).

Theorem 4.1

Assume above conditions hold, then the shrinkage estimator βˆ satisfies consistency in variable selection, Pr(AˆA)1.

Theorem 4.1 is given in Yu, Dong, Shao (Citation2016), Yu, Dong, Zhu (Citation2016), Lin et al. (Citation2017), and Zhu et al. (Citation2011).

5. Minimax rate

Recently, an impressive range of penalised SIR methods has been proposed to estimate the central subspace in a sparse fashion. However, few of them considered the sparse sufficient dimension reduction from a decision-theoretical point of view. To address this issue, Tan et al. (Citation2020) established the minimax rates of convergence for estimating the sparse SIR directions under various commonly used loss functions in the literature of sufficient dimension reduction. Lin et al. (Citation2019) introduced a simple Lasso regression method to obtain an estimator of the sufficient dimension reduction space, which is only the first step of Tan et al. (Citation2020). Moreover, Tan et al. (Citation2020) discovered the possible trade-off between statistical guarantee and computational performance for sparse SIR and proposed an adaptive estimation scheme for sparse SIR which is computationally tractable and rate optimal under the condition that logp=o(n), which is weaker than Lin et al. (Citation2019).

As we can see that, the kernel matrix MSIR can be estimated as MˆSIR==1Hμˆμˆ/pˆ. Then it is natural to estimate β via replacing M and Σ in (Equation8) by their sample estimators, which yields (14) βˆSIR=argmaxBRp×dTr(BMˆSIRB) s.t. BΣˆB=Id and |Supp(B)|s,(14) The solution βˆSIR in (Equation14) is called the natural sparse SIR estimator. The following theorem establishes the lower bound and upper bound of the four loss functions for the natural sparse SIR estimator.

Theorem 5.1

Assume nλ2C0logeps for some sufficiently large constant C0. Then there exist positive constants C and c0 such that infβˆsupPPEPLG(βˆ,β)Cslog(ep/s)nλ2c0,infβˆsupPPPLP(βˆ,β)Cslog(ep/s)nλ2c00.8,infβˆsupPPPLX(βˆ,β)Cslog(ep/s)nλ2c00.8, where P=P(n,H,s,p,d,λ;K,m).

Theorem 5.2

Assume that slog(ep/s)nλ2c for some small constant c(0,1). Then for any C>0, there exists a positive constant C such that LG(βˆ,β)LP(βˆ,β)LX(βˆ,β)LC(βˆ,β)Cslog(ep/s)nλ2 with probability greater than 12exp(C(s+log(ep/s))) uniformly over PP(n,H,s,p,d,λ;K,m).

Since SIR can be rewritten as a least-square formulation, they finally proposed an adaptive estimation scheme for sparse SIR which is computationally tractable and rate optimal. More details about the adaptive sparse SIR estimator can refer to Tan et al. (Citation2020).

6. Further investigation

6.1. Marginal utility

Motivated by Yu, Dong, Shao (Citation2016), we can extend their method to SAVE and DR. Let ϕ=E{XX𝟙(YJ)}. Then MSAVE==1Hp{Σvar(X|YJ)}2 can be written as MSAVE==1Hp{Σϕ/p+μμ/p2}2. Therefore mkSAVE=ekΣ1=1Hp{Σϕ/p+μμ/p2}2×Σ1ek. The marginal utility mk is estimated by mˆkSAVE=ekΣˆ1=1Hpˆ{Σˆϕˆ/pˆ+μˆμˆ/pˆ2}2×Σˆ1ek, where ϕˆ=i=1nXiXi𝟙(YiJ)/n. For a given threshold bn, the active set A is estimated by including the predictors such that mˆkSAVE exceeds bn or Aˆ={k:mˆkSAVEbn}.

Next, we consider marginal DR with the Dantzig selector. Then MDR==1H2p[E{E2(XX)}+E2{E(X|YJ)E(X|YJ)}+E{E(X|YJ)E(X|YJ)}×E{E(X|YJ)E(X|YJ)}]2Σ can be written as MDR=2=1HpϕΣ2+2=1Hμμ/p22+2=1Hμμ/p2=1Hμμ/p2. Therefore mkDR=2ekΣ1=1Hμμ/p22=1HpϕΣ2+=1Hμμ/p22+=1Hμμ/p2=1Hμμ/p2=1Hμμ/p22Σ1ek. The marginal utility mk is estimated by mˆkDR=2ekΣˆ1=1Hμμ/p22=1HpˆϕˆΣˆ2+=1Hμˆμˆ/pˆ22+=1Hμˆμˆ/pˆ2=1Hμˆμˆ/pˆ2=1Hμμ/p22Σˆ1ek. For a given threshold bn, the active set A is estimated by including the predictors such that mˆkDR exceeds bn, or Aˆ={k:mˆkDRbn}. Following the proof of Yu, Dong, Shao (Citation2016), we can expect the marginal SAVE and DR with the Dantzig selector to achieve selection consistency.

6.2. Minimax rate

Motivated by Tan et al. (Citation2020), we can further investigate the natural sparse SAVE estimator and upper error bound. Let EnX=1ni=1nXi and Σˆ=1n1i=1n(XiEnX)(XiEnX) be the sample mean and sample covariance of X, then the SAVE kernel matrix M is estimated as MˆSAVE==1Hpˆ{Σˆϕˆ/pˆ+μˆμˆ/pˆ2}2. Similarly, the DR kernel matrix is estimated as MˆDR=2=1HpˆϕˆΣˆ2+2=1Hμˆμˆ/pˆ22+2=1Hμˆμˆ/pˆ2=1Hμˆμˆ/pˆ2. Then it is natural to estimate β via replacing M and Σ in (Equation8) by their sample estimators, which yields (15) βˆSAVE=argmaxBRp×dTr(BMˆSAVEB) s.t. BΣˆB=Id and |Supp(B)|s,βˆDR=argmaxBRp×dTr(BMˆDRB) s.t. BΣˆB=Id and |Supp(B)|s(15) The solution βˆSAVE and βˆDR in (Equation15) are called the natural sparse SAVE and DR estimator. The following theorem establishes the lower bound and upper bound of the four loss functions for the natural sparse SAVE and DR estimator.

Theorem 6.1

Assume nλ2C0logeps for some sufficiently large constant C0. Then there exist positive constants C and c0 such that infβˆsupPPEPLG(βˆ,β)Cslog(ep/s)nλ2c0,infβˆsupPPPLP(βˆ,β)Cslog(ep/s)nλ2c00.8,infβˆsupPPPLX(βˆ,β)Cslog(ep/s)nλ2c00.8, where P=P(n,H,s,p,d,λ;K,m).

Theorem 6.2

Assume that slog(ep/s)nλ2c for some small constant c(0,1). Then for any C>0, there exists a positive constant C such that LG(βˆ,β)LP(βˆ,β)LX(βˆ,β)LC(βˆ,β)Cslog(ep/s)nλ2 with probability greater than 12exp(C(s+log(ep/s))) uniformly over PP(n,H,s,p,d,λ;K,m). In which βˆ is constructed in (Equation15).

Following by Tan et al. (Citation2020), βˆ in (Equation15) is rate optimal under general loss, projection loss and prediction loss. Moreover, the natural sparse SAVE estimator βˆSAVE and DR estimator βˆDR can also be regarded as one optimal estimator for the SAVE directions and DR directions. However, the estimation procedure (Equation15) depends on the unknown sparsity parameter s and is computationally infeasible as it involves exhaustive search over all BRp×d subject to the sparsity constraint. Tan et al. (Citation2020) defined a refined sparse SIR estimator based on that SIR can be viewed as transformation-based projection pursuit. Since SAVE and DR cannot be rewritten as a least-square formulation, we do not define refined sparse SAVE and DR estimator.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

This research is supported by the National Natural Science Foundation of China Grant 11971170, the 111 project B14019 and the Program for Professor of Special Appointment (Eastern Scholar) at Shanghai Institutions of Higher Learning.

Notes on contributors

Lu Li

Lu Li is currently a Ph.D student at School of Statistics, East China Normal University.

Xuerong Meggie Wen

Dr Xuerong Meggie Wen is currently an associate professor of Statistics at Dept. of Mathematics and Statistics, Missouri University of Science and Technology.

Zhou Yu

Dr Zhou Yu is a Professor of Statistics at School of Statistics, East China Normal Univercity.

References

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.