Search in:

Statistical Theory and Related Fields Volume 4, 2020 - Issue 2

Submit an article Journal homepage

Free access

1,089

Views

CrossRef citations to date

Altmetric

Listen

Review Article

A selective overview of sparse sufficient dimension reduction

Lu Lia East China Normal University, Shanghai, People's Republic of ChinaView further author information

Xuerong Meggie Wenb Missouri University of Science and Technology, Rolla, MO, USAView further author information

Zhou Yua East China Normal University, Shanghai, People's Republic of ChinaCorrespondence[email protected]
View further author information

Pages 121-133 | Received 19 Jan 2020, Accepted 23 Sep 2020, Published online: 10 Nov 2020

Cite this article
https://doi.org/10.1080/24754269.2020.1829389
CrossMark

In this article

1. Introduction
2. Review of sufficient dimension reduction
3. The current literature of variable selection via sufficient dimension reduction
4. The current literature of variable screening
5. Minimax rate
6. Further investigation
Disclosure statement
Additional information
References

Full Article
Figures & data
References
Citations
Metrics
Reprints & Permissions
View PDF PDF View EPUB EPUB

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

High-dimensional data analysis has been a challenging issue in statistics. Sufficient dimension reduction aims to reduce the dimension of the predictors by replacing the original predictors with a minimal set of their linear combinations without loss of information. However, the estimated linear combinations generally consist of all of the variables, making it difficult to interpret. To circumvent this difficulty, sparse sufficient dimension reduction methods were proposed to conduct model-free variable selection or screening within the framework of sufficient dimension reduction. We review the current literature of sparse sufficient dimension reduction and do some further investigation in this paper.

Keywords:

Minimax rate
sparse sufficient dimension reduction
variable selection
variable screening

1. Introduction

The rapid development of data collection technology in areas, such as biology, financial econometrics and signal processing, has posed a great challenge for traditional multivariate analysis. High-dimensional data analysis becomes ubiquitous and increasingly important. Dimension reduction, and in particular sufficient dimension reduction for regression, offers an appealing avenue to tackle high-dimensional problems. It is often desirable to reduce the dimensionality of the problem by replacing the original high-dimensional data with a low-dimensional space composed of a few linear combinations of predictors, which are usually much smaller than the original dimension. Although sufficient dimension reduction is an effective way to extract relevant information from high-dimensional data sets, while grasping the important features or patterns in the data, the linear combinations usually consist of all original predictors which makes the interpretation difficult. This limitation can be overcome via variable selection, where a subset of relevant predictor variables is selected. The removal of the excess variables not only can reduce the noise to the precise estimation, alleviate the collinearity issue, but also help reduce the computational cost caused by high-dimensional data.

As one of the most important dimension reduction approaches, many variable selection methods have been developed. Some most popular variable selection approaches are developed under the linear model or the generalised linear model paradigm, such as nonnegative garrotte (Breiman, Citation1995), the least absolute shrinkage and selection operator (Lasso, hereafter) (Tibshirani, Citation1996), the smoothly clipped absolute deviation (SCAD, hereafter) (Fan & Li, Citation2001), adaptive Lasso (Zou, Citation2006), group Lasso (Yuan & Lin, Citation2006), Dantzig selector (Candes & Tao, Citation2007) and the minimax concave plus penalty (MCP, hereafter) (Zhang, Citation2010).

These model-based variable selection methods assume the underlying true model is known up to a finite dimensional parameter or the imposed working model is usefully similar to the true model. However, the true model might be in a complex form and it is usually unknown. If the underlying modelling assumption is violated, these variable selection methods might fail. Hence, model-free variable selection method, which does not require the full knowledge of the underlying true model, is called for. It has been shown that the general framework of sufficient dimension reduction is useful for variable selection (Bondell & Li, Citation2009) since no pre-specified underlying models between the response and the predictors are required. So model-free variable selection can be achieved through the framework of SDR (Cook, Citation1998; Li, Citation1991, Citation2000).

Let $X = (X_{1}, \dots, X_{p})^{⊤}$ be the predictor and $Y$ be the scalar response. The goal of variable selection is to seek the smallest subset of the predictors $X_{A}$ , with partition $X = (X_{A}^{⊤}, X_{A^{c}}^{⊤})^{⊤}$ , such that (1) $Y boxhU X_{A^{c}} | X_{A} .$ (1) Here $A$ denotes a subset of indices of ${1, \dots, p}$ corresponding to the relevant predictor set $X_{A}$ , and $A^{c}$ is the complement of $A$ , i.e., $X_{A} = {x_{i} : i \in A}$ and $X_{A^{c}} = {x_{i} : i \in A^{c}}$ . Condition (Equation1(1) $Y boxhU X_{A^{c}} | X_{A} .$ (1) ) implies that $X_{A}$ contains all the active predictors in terms of predicting $Y$ . The existence and uniqueness of $A$ were discussed in details in Yin and Hilafu (Citation2015). Ideally, we want to find the smallest index set $A$ satisfying (Equation1(1) $Y boxhU X_{A^{c}} | X_{A} .$ (1) ), in which case no inactive predictors are included in $X_{A^{c}}$ .

Model-free variable selection is closely related to sufficient dimension reduction, which aims to find $β \in R^{p \times d}$ with $d \leq p$ , such that (2) $Y boxhU X | β^{⊤} X,$ (2) that is, $Y$ is independent of $X$ conditioning on $β^{⊤} X$ . The column space of such $β$ , $S p a n (β)$ , is called a dimension reduction space. Under mild assumptions, such as given in Cook (Citation1996) and Yin et al. (Citation2008), the intersection of all such spaces is itself a dimension reduction space. In this case, we call the intersection the central subspace for the regression of $Y$ on $X$ , and denote it by $S_{Y | X}$ . And its dimension, $d = d i m (S_{Y | X})$ , is usually much smaller than p, the dimension of the original predictor. Following the partition of $X$ , we can partition $β$ accordingly as $β = (\begin{matrix} β_{A} \\ β_{A^{c}} \end{matrix}), β_{A} \in R^{| A | \times d}, β_{A^{c}} \in R^{(p - | A |) \times d},$ where $| A |$ is the cardinality of $A$ . Hence, (Equation1(1) $Y boxhU X_{A^{c}} | X_{A} .$ (1) ) is equivalent to $β_{A^{c}} = 0$ .

Many methods have been proposed for estimating the basis of $S_{Y | X}$ in the literature, including sliced inverse regression (SIR, hereafter) (Li, Citation1991), sliced average variance estimation (SAVE, hereafter) (Cook & Weisberg, Citation1991), principal Hessian directions (PHD, hereafter) (Li, Citation1992), minimum average variance estimation (MAVE, hereafter) (Xia et al., Citation2002), directional regression (DR, hereafter) (Li & Wang, Citation2007), principal fitted component (PFC, hereafter) (Cook & Forzani, Citation2008), semiparametric approach (Ma & Zhu, Citation2012), etc. Several methods have been also suggested for simultaneously selecting the contributing predictors. These include shrinkage SIR (Ni et al., Citation2005), sparse SIR (Li, Citation2007; Li & Nachtsheim, Citation2006), sparse SAVE and sparse PHD (Li, Citation2007), constrained canonical correlation (Zhou & He, Citation2008), the general shrinkage strategy for inverse regression estimation (Bondell & Li, Citation2009), the regularised SIR estimator with SCAD penalty (Wu & Li, Citation2011) and coordinate independent sparse estimation (CISE, hereafter) (Chen et al., Citation2010), conditional covariance minimisation (Chen et al., Citation2017), etc.

Although these aforementioned methods can select the significant predictors without assuming an underlying parametric model, they are not designed for $p ≫ n$ problems, in which the number of predictor variables is larger than the number of observations. The so-called large p small n problems are increasingly common with rapid technological advances in data collection and have attracted a lot of research interests. We hereby give a very brief review of model-free variable selections via sufficient dimension reduction approach under the $p ≫ n$ setting. Li and Yin (Citation2008) proposed sparse ridge SIR, which combined SIR with both $ℓ_{1}$ - and $ℓ_{2}$ -regularisation to achieve dimension reduction and variable selection simultaneously, even when p>n. Yu et al. (Citation2013) suggested combining SIR with the Dantzig selector (Candes & Tao, Citation2007) to recover the central subspace in the general semiparametric models. A non-asymptotic error bound for the resulting estimator is derived and the error bound is of order $O_{p} ((\log p / n)^{1 / 2})$ , which appears to be optimal. Moreover, they proposed another regularised version of SIR with the adaptive Dantzig selector. The resulting estimators defined from variable selection are asymptotically normal even when the predictor dimension diverges to infinity. It is worth mentioning that the $| A |$ is fixed in Yu et al. (Citation2013). Yu, Dong, Zhu (Citation2016) proposed trace pursuit for model-free variable selection under the sufficient dimension reduction paradigm. Two distinct algorithms are proposed: stepwise trace pursuit (STP, hereafter) and forward trace pursuit (FTP, hereafter). Stepwise trace pursuit achieved selection consistency with fixed p and is applicable in the setting with p>n. Furthermore, forward trace pursuit can serve as an initial screening step to speed up the computation in the case of ultrahigh dimensionality. Li and Dong (Citation2020) extended trace pursuit method to matrix-valued predictors based on Yu, Dong, Zhu (Citation2016). To test the importance of rows, columns and submatrices of the predictor matrix in terms of predicting the response, three types of hypotheses are formulated under a unified framework. The asymptotic properties of the test statistics under the null hypothesis are established and a permutation testing algorithm is also introduced to approximate the distribution of the test statistics. Tan et al. (Citation2018) developed a convex formulation for fitting sparse SIR in high dimensions. They solved the resulting convex optimisation problem via the linearised alternating direction methods of multiple algorithms and established an upper bound on the subspace distance between the estimated and the true subspaces. Unlike Yu et al. (Citation2013), Lin et al. (Citation2019) allowed $| A |$ goes to infinity. By constructing artificial response variables made up from top eigenvectors of the estimated conditional covariance matrix, Lin et al. (Citation2019) introduced a simple Lasso regression method to obtain an estimator of the sufficient dimension reduction space. The resulting algorithm, Lasso-SIR, is shown to be consistent and achieves the optimal convergence rate under certain sparsity conditions when p is of order $o (n^{2} c^{2})$ , where c is the generalised signal-to-noise ratio, which is only the first step of Tan et al. (Citation2020). Moreover, Tan et al. (Citation2020) discovered the possible trade-off between statistical guarantee and computational performance for sparse SIR and proposed an adaptive estimation scheme for sparse SIR which is computationally tractable and rate optimal under the condition that $\log p = o (n)$ , which is weaker than Lin et al. (Citation2019).

There is considerable literature on applying sufficient dimension reduction for model-free selection, but the study of developing screening consistency for the ultra-high dimensional setting is still lacking. To fulfil the aforementioned gaps, Zhu et al. (Citation2011) proposed a variable screening procedure under a unified model framework, which contains a wide variety of commonly used parametric and semiparametric models. The new method does not require imposing a specific model structure on regression functions and thus is particularly appealing to ultrahigh-dimensional regressions. They also showed the proposed method achieves screening consistency even with the number of predictors growing at an exponential rate of the sample size. Yu, Dong, Shao (Citation2016) proposed an approach called marginal SIR for model-free variable selection. Furthermore, marginal SIR with Dantzig selector exploits the sparsity structure in the marginal utility and achieves the desirable selection consistency property. Lin et al. (Citation2017) first introduced a large class of models depending on the smallest non-zero eigenvalue of the kernel matrix of SIR, then the minimax rate for estimating the central space is derived, which is the first paper studied the minimax estimation of sparse SIR. However, they only considered the projection loss (Li & Wang, Citation2007). More importantly, their theoretical study is based on the assumption that the covariance matrix is diagonal. As far as we know, most of mentioned work mainly focus on SIR with consistency on variable selection. Qian et al. (Citation2019) provided simultaneous analysis for PFC and SAVE. Furthermore, their approach allows many quantities such as the structural dimension, the number of important predictors and the number of slices to diverge with n. To deliver the most essential messages, in the following section, we focus our discussion on the papers mentioned above.

2. Review of sufficient dimension reduction

Sufficient dimension reduction aims to find the column space of $β$ with the smallest dimension d. In other words, sufficient dimension reduction is proposed as a problem of estimating a space, instead of the classic statistical problem of estimating parameters. As mentioned in the introduction, there are many approaches in the literature of sufficient dimension reduction for estimating the column space $β$ : sliced inverse regression (SIR; Li, Citation1991), sliced average variance estimation (SAVE; Cook & Weisberg, Citation1991), minimum average variance estimation (MAVE; Xia et al., Citation2002), the kth moment estimation (Yin & Cook, Citation2002, Citation2003), inverse regression (Cook & Ni, Citation2005), directional regression (DR; Li & Wang, Citation2007), sliced regression (SR; Wang & Xia, Citation2008), likelihood acquired directions (LAD; Cook & Forzani, Citation2009), semiparametric approaches (Ma & Zhu, Citation2012, Citation2013a, Citation2013b, Citation2014), etc. We mainly review three inverse regression-based methods (SIR; SAVE and DR) for estimating $S_{Y | X}$ for our subsequent investigation.

Inverse regression methods constitute the oldest class of dimension reduction methods and are still under active development currently. The main idea of the inverse regression is to reverse the relation between the response and the predictors (Li, Citation1991). Instead of considering distributions or expectations of functions of Y conditional on $X$ , which suffers the curse of dimensionality when $X$ is high dimensional, these inverse regression-based methods consider expectations of functions of $X$ conditional on Y, which is suddenly a low dimensional problem because Y is univariate. The inverse regression-based methods often are based on some additional assumptions on the predictors to link the low dimensional problem and the original high dimensional problem. These additional assumptions are given as follows.

(W1)	linearity condition $E (X \| β^{⊤} X) = Σ β (β^{⊤} Σ β)^{- 1} β^{⊤} X$ .
(W2)	constant variance condition $c o v (X \| β^{⊤} X) = Σ - Σ β (β^{⊤} Σ β)^{- 1} β^{⊤} Σ β (β^{⊤} Σ β)^{- 1} β^{⊤} Σ$ ,

where $Σ = c o v (X)$ . As is known to all, SIR only requires the condition (W1) holds. However, SAVE and DR need both conditions.

When the linearity condition and the constant variance condition are satisfied, the inverse regression methods formulate the problem of estimating $S_{Y | X}$ into an eigen-decomposition problem. Let $M$ be the kernel matrix of a specific inverse regression based dimension reduction method. For the sufficient dimension reduction methods that aim to estimate $S_{Y | X}$ , the kernel matrices corresponding to the three most well-known inverse regression methods are summarised as below: $\begin{aligned} SIR: M_{S I R} = v a r {E (X | Y)}; \\ SAVE: M_{S A V E} = E {Σ - v a r (X | Y)}^{2}; \\ DR: M_{D R} \\ = 2 E^{2} {E (X | Y) E (X^{⊤} | Y)} \\ + 2 E {E (X^{⊤} | Y) E (X | Y)} E {E (X | Y) E (X^{⊤} | Y)} \\ + 2 E {E^{2} (X X^{⊤})} - 2 Σ . \end{aligned}$ Assuming $d = S_{Y | X}$ is known, the procedure for a generalised eigenvalue-decomposition of the kernel matrix $M$ , that is $\begin{aligned} M β_{i} & = λ_{i} Σ β_{i}, with β_{i}^{⊤} Σ β_{j} = 1 if i = j, \\ β_{i}^{⊤} Σ β_{j} = 0 else i \neq j, \end{aligned}$ where $i = 1, \dots, p$ , and $λ_{1} \geq \dots \geq λ_{d} > 0 = λ_{d + 1} = \dots = λ_{p}$ are the eigenvalues. Then the eigenvectors corresponding to the nonzero eigenvalues $β = (β_{1}, \dots, β_{d})$ form a basis of $S_{Y | X}$ . Thus the sufficient dimension reduction directions $β$ can also be identified through the following optimisation problem (Tan et al., Citation2020): (3) $\hat{β} = \underset{B \in R^{p \times d}}{\arg max} T r (B^{⊤} M B) s.t. B^{⊤} Σ B = I_{d} .$ (3)

3. The current literature of variable selection via sufficient dimension reduction

3.1. Oracle property under the setting p<n

In the general framework of condition (Equation1(1) $Y boxhU X_{A^{c}} | X_{A} .$ (1) ), the shrinkage SIR method is developed in Ni et al. (Citation2005) by applying the Lasso approach to SIR. When a subset of predictors are irrelevant, then the corresponding row estimates of $β$ is equal to 0, and consequently to achieve variable selection. Let $α = (α_{1}, \dots, α_{p})^{⊤}$ , with $α_{i} \in R$ , $i = 1, \dots, p$ , be the shrinkage vector. Then based on the expression (Equation3(3) $\hat{β} = \underset{B \in R^{p \times d}}{\arg max} T r (B^{⊤} M B) s.t. B^{⊤} Σ B = I_{d} .$ (3) ), the estimation of the shrinkage vector can be rewritten to minimise the following function over α (Ni et al., Citation2005): (4) $\begin{aligned} \hat{α} & = \underset{α}{\arg min} n (V e c (\hat{β}) \\ - V e c {d i a g (α) \hat{β}})^{⊤} \hat{M} (V e c (\hat{β}) - V e c {d i a g (α) \hat{β}}), \\ subject to \sum_{i = 1}^{p} | α_{i} | \leq t, t \geq 0, \end{aligned}$ (4) where $\hat{M}$ is the estimator of kernel matrix $M$ . To investigate the asymptotic behaviour, we consider the Lagrangian formulation of the constrained optimisation problem. Specially, the optimisation problem in expression (Equation4(4) $\begin{aligned} \hat{α} & = \underset{α}{\arg min} n (V e c (\hat{β}) \\ - V e c {d i a g (α) \hat{β}})^{⊤} \hat{M} (V e c (\hat{β}) - V e c {d i a g (α) \hat{β}}), \\ subject to \sum_{i = 1}^{p} | α_{i} | \leq t, t \geq 0, \end{aligned}$ (4) ) can be reformulated as $\hat{α} = \arg min_{α} (| | U - W α | |^{2} + τ_{n} \sum_{i = 1}^{p} | α_{i} |),$ for some non-negative penalty constant $τ_{n}$ . In which, $U = n^{1 / 2} {\hat{M}}^{1 / 2} V e c (\hat{β}), W = n^{1 / 2} {\hat{M}}^{1 / 2} \hat{β} .$ Then the central dimension reduction subspace $S_{Y | X}$ is estimated by $S p a n {d i a g (\hat{α}) \hat{β}}$ . Li (Citation2007) extended shrinkage SIR method to SAVE and PHD methods, where the central dimension subspace is estimated the same as Ni et al. (Citation2005), and $\hat{β}$ corresponds to the estimated central dimension reduction directions of SAVE and PHD methods, respectively. Bondell and Li (Citation2009) proposed a general shrinkage estimation strategy for the entire inverse regression estimation family that is capable of simultaneous sufficient dimension reduction and variable selection. They considered the adaptive Lasso, $\hat{α} = \arg min_{α} (| | U - W α | |^{2} + τ_{n} \sum_{i = 1}^{p} w_{i} | α_{i} |),$ where $w = (w_{1}, \dots, w_{p})^{⊤}$ is a known weights vector. They also demonstrated that the proposed class of shrinkage estimators has the desirable oracle property of consistency in variable selection while retaining root n estimation consistency.

However, most existing sparse dimension reduction methods mentioned above are conducted stepwise, estimating a sparse solution for a basis matrix of the central subspace column by column. Instead, Chen et al. (Citation2010) proposed a unified one-step approach to reduce the number of variables appearing in the estimate of $S_{Y | X}$ . Their approach, which depends operationally on Grassmann manifold optimisation, can achieve dimension reduction and variable selection simultaneously. Additionally, their proposed method has the oracle property: under mild conditions, the proposed estimator would perform asymptotically as well as if the true irrelevant predictors were known. More importantly, Chen et al. (Citation2010) is an extension to Bondell and Li (Citation2009), which combined SIR, SAVE, DR with adaptive Lasso to variable selection. Zhou and He (Citation2008) proposed a constrained canonical correlation procedure ( $C^{3}$ ) based on imposing the $L_{1}$ -norm constraint on the effective dimension reduction estimates in CANCOR, followed by a simple variable selection method. Using the B-spline basis functions generated for the response variable, the CANCOR method (Fung et al., Citation2002) is asymptotically equivalent to SIR. Suppose that the range of $Y$ is a bounded interval $[a, b]$ , given $k_{n}$ interval knots in $[a, b]$ and the spline order m, we generate $m + k_{n}$ B-spline basis functions. Under the linearity condition, CANCOR estimates a set of effective dimension reduction directions by estimating the canonical variates between the B-spline basis functions and $X$ . Since the generated $m + k_{n}$ B-spline basis functions add to 1, we use in CANCOR the first $m + k_{n} - 1$ basis function of $Y$ , $π (Y) = (π_{1} (Y), \dots, π_{m + k_{n} - 1} (Y))^{⊤}$ . Let $X = (X_{1}, \dots, X_{n})^{⊤}$ and $Π_{n \times (m + k_{n} - 1)} = (π (Y_{1}), \dots, π (Y_{n}))^{⊤}$ be the data matrices containing the predictor values and the B-spline basis function values. Then the CANCOR method is to estimate the canonical correlations between the columns of $X$ and the columns of Π. The dimensionality of the central dimension reduction subspace is selected by performing the following sequential tests on the number of the non-zero canonical correlations, $H_{0} : γ_{s} > γ_{s + 1} = 0$ versus $H_{1} : γ_{s + 1} > 0$ for $s = 0, 1, \dots, p - 1$ , where $γ_{s}$ are the asymptotic canonical correlations between $π (Y)$ and $X$ in decreasing order. The dimensionality estimate for d is the smallest s such that $H_{0}$ is not rejected. The CANCOR method actually solves an optimisation problem that sequentially finds the directions $β$ with the maximum correlation between $β^{⊤} X$ and some functions of $Y$ . Their procedure is attractive because they demonstrated that it also has the oracle property.

Sparse sufficient dimension reduction methods mentioned above focus on the cases when p is fixed. For regressions with diverging p, estimation and variable selection methods are also developed in the framework of sufficient dimension reduction: Zhu et al. (Citation2006) studied the asymptotic properties of SIR as p diverges, but their result is for SIR only, and variable selection is not studied at all. Zhu and Zhu (Citation2009a) investigated weighted partial least squares with a diverging p, but again variable selection is not derived. Zhu and Zhu (Citation2009b) investigated variable selection with a diverging number of predictors through inverse regression, but focused on single-index models only. By contrast, Wu and Li (Citation2011) established asymptotic properties for a family of inverse regression estimators that includes SIR, studied simultaneous dimension reduction and variable selection with a particular emphasis on the latter and encompassed more general forms, while the number of predictors p is allowed to diverge as the sample size n approaches infinity. Wu and Li (Citation2011) adopted the SCAD type penalty that was first introduced by Fan and Li (Citation2001), and combined it with sufficient dimension reduction estimator, that is $\hat{α} = \arg min_{α} (| | U - W α | |^{2} + \sum_{i = 1}^{p} p_{τ_{n}} (| α_{i} |)) .$ The penalty $p_{τ_{n}} (\cdot)$ are not necessarily the same for all i. Wu and Li (Citation2011) also showed that the penalised estimator selects all truly contributing predictors and excludes all irrelevant ones with probability approaching one.

Based on the work in kernel dimension reduction, Chen et al. (Citation2017) proposed a method to perform feature selection via a constrained optimisation problem. The corresponding SDR method can refer to Fukumizu et al. (Citation2009); Fukumizu Leng (Citation2014). Many previous kernel approaches are filter methods based on the Hilbert–Schmidt Independence Criterion (HSIC, Gretton et al., Citation2005). Chen et al. (Citation2017) proposed to use the trace of the conditional covariance operator as a criterion for feature selection. Let $(H_{1}, k_{1})$ denote an RKHS supported on $X \subset R^{p}$ . Then the trace of the conditional covariance operator, $t r a c e (Σ_{Y Y | X})$ can be interpreted as a dependence measure, as long as the $H_{1}$ is large enough. Then the problem of supervised feature selection reduces to minimising the trace of the conditional covariance operator over subsets of features with controlled cardinality: $min_{T : | T | = d} Q (T) := T r (Σ_{Y Y | X_{T}}) .$ They also showed that empirical estimate of the criterion is consistent as the sample size increases. It is worth noting that kernel feature selection methods have the advantage of capturing nonlinear relationships between the features and the labels.

Theorem 3.1

Assume $n^{1 / 2} {V e c (\hat{β)} - V e c (β)} \to N (0, Γ),$ for some $Γ > 0,$ and that $M_{n}^{1 / 2} = M^{1 / 2} + o (\frac{1}{\sqrt{n}})$ . Suppose that $τ_{n} \to \infty$ and $\frac{τ_{n}}{\sqrt{n}} \to 0$ with $p < n,$ then the shrinkage estimator $\hat{β}$ satisfies

consistency in variable selection, $Pr (\hat{A} = A) \to 1,$ and
asymptotic normality, $n^{1 / 2} {V e c ({\hat{β}}_{A}) - V e c (β_{A})} \to N (0, Λ),$ for some $Λ > 0$ .

Remark 3.1

Theorem 3.1, part (a), indicates that the sparse sufficient dimension reduction estimator can select contributing predictors consistently, i.e., for all $i \notin A$ we have $Pr ({\hat{α}}_{i} \neq 0) \to 0$ , and for all $i \in A$ we have $Pr ({\hat{α}}_{i} \neq 0) \to 1$ . Theorem 3.1, part (b), further shows that the estimator for $β_{A}$ that corresponds to the contributing predictors is root n consistent. The oracle property as shown in Theorem 3.1 is given in Bondell and Li (Citation2009), Chen et al. (Citation2010), Wu and Li (Citation2011) and Zhou and He (Citation2008). Most of the methods mentioned above cannot achieve the desired property with p>n, however, Wu and Li (Citation2011) showed that their proposed method can obtain selection consistency when p diverge as the sample size n goes to infinity. Then we turn to investigate the oracle property with p>n.

3.2. Oracle property under the setting $p ≫ n$

Large-p-small-n problems appear frequently in fields such as biology, economics and finance. While those variable selection methods have been successfully applied in many high-dimensional analyses, modern applications in areas such as genomics and high-frequency finance further push the dimensionality of data to an even larger scale, where p may grow exponentially with n. Such ultrahigh-dimensional data present simultaneous challenges of computational expediency, statistical accuracy and algorithm stability. It is difficult to directly apply the aforementioned variable selection methods to those ultrahigh-dimensional statistical learning problems due to the computational complexity inherent in those methods. To reduce the predictor dimension in semiparametric regressions, Yu et al. (Citation2013) proposed a $ρ_{1}$ -minimisation of SIR with the Dantzig selector (Candes & Tao, Citation2007), which is defined as (5) $\begin{aligned} min | | η_{k} | |_{ℓ_{1}}, (l = 1, \dots, k - 1) \\ such that | | \hat{M} η_{k} - ν_{k} \hat{Σ} η_{k} | | \\ \leq ζ_{k}, | η_{k}^{⊤} \hat{Σ} η_{k} - 1 | \leq ζ_{k}, | η_{k}^{⊤} \hat{Σ} η_{l} | \leq ζ_{k}, \end{aligned}$ (5) where $k = 1, \dots, d$ , $ν_{k} = η_{k}^{⊤} \hat{M} η_{k}$ , $| η |_{ℓ_{1}} = \sum_{i = 1}^{p} | η_{i} |$ and $η_{0}$ is a $p \times 1$ zero vector. Furthermore, they established a non-asymptotic error bound for the resulting estimator when $| A |$ is fixed. Yu et al. (Citation2013) also extended the regularisation concept to SIR with an adaptive Dantzig selector, which is defined by (6) $\begin{aligned} min | | W_{k} η_{k} | |_{ℓ_{1}}, \\ such that | | W_{k}^{- 1} (\hat{M} {\hat{β}}_{k}^{0} - {\hat{λ}}_{k}^{0} \hat{Σ} η_{k}) | | \leq ζ_{k}, \end{aligned}$ (6) where $W_{k} = d i a g (w_{k 1}, \dots, w_{k p})$ is the a known weight matrix and $w_{k j}$ is a specified positive value, which should vary inversely with the magnitude of ${\hat{β}}_{k j}^{0}$ . Yu et al. (Citation2013) proposed a two-step estimation procedure to select the contributing predictors. In the first step, they screened out informative predictors based on (Equation5(5) $\begin{aligned} min | | η_{k} | |_{ℓ_{1}}, (l = 1, \dots, k - 1) \\ such that | | \hat{M} η_{k} - ν_{k} \hat{Σ} η_{k} | | \\ \leq ζ_{k}, | η_{k}^{⊤} \hat{Σ} η_{k} - 1 | \leq ζ_{k}, | η_{k}^{⊤} \hat{Σ} η_{l} | \leq ζ_{k}, \end{aligned}$ (5) ). This is called Dantzig selector based SIR. In the second step, they enhance the sparsity and the estimation efficiency with (Equation6(6) $\begin{aligned} min | | W_{k} η_{k} | |_{ℓ_{1}}, \\ such that | | W_{k}^{- 1} (\hat{M} {\hat{β}}_{k}^{0} - {\hat{λ}}_{k}^{0} \hat{Σ} η_{k}) | | \leq ζ_{k}, \end{aligned}$ (6) ), based on the predictors selected in the first step, called iterative adaptive Dantzig selector based SIR. This ensures that all contributing predictors are selected with high probability and that the resulting estimator is asymptotically normal even when the predictor dimension diverges to infinity.

However, there is a gap between the optimisation problem and the theoretical results: there is no guarantee that the estimator obtained from solving the proposed biconvex optimisation problem is the global minimum. Most existing work in the high-dimensional sufficient dimension reduction literature involves nonconvex optimisation problems. Moreover, they seek to estimate a set of reduced predictors that are not identifiable by definition, rather than the central subspace. Yin and Hilafu (Citation2015) proposed a sequential approach for estimating high-dimensional SIR. Both proposals are stepwise procedures that do not correspond to solving a convex optimisation problem. Moreover, as discussed in Yin and Hilafu (Citation2015), theoretical properties for their proposed estimators are hard to establish due to the sequential procedure used to obtain the estimators. In the high-dimensional setting, Lin et al. (Citation2018) proposed a screening approach to perform variable selection and established an error bound for the estimators, which allows $| A |$ goes to infinity. The selected variables are then used to fit classic SIR. Furthermore, the resulting algorithm is shown to be consistent and achieved the optimal convergence rate under certain sparsity conditions when p is of order $o (n^{2} c^{2})$ , where c is the generalised signal-to-noise ratio. Tan et al. (Citation2018) proposed a convex formulation for sparse SIR in the high-dimensional setting by adapting techniques from the sparse canonical correlation analysis. Their proposal estimates the central subspace directly and performs variable selection simultaneously. Moreover, the proposed method can be adapted for sufficient dimension reduction methods that can be formulated as generalised eigenvalue problems.

As mentioned in introduction, most literature mainly focus on SIR with consistency on variable selection. Qian et al. (Citation2019) proposed methods under a unified minimum discrepancy framework with regularisation. Consistency results in both central subspace estimation and variable selection are established simultaneously for some famous SDR methods, including SIR, PFC and SAVE. More importantly, their approach allows many quantities such as the structural dimension, the number of important predictors and the number of slices to diverge with n. Unlike many high-dimensional SDR methods, their method did not necessarily require a sparsity condition on the predictor covariance matrix or the maximum eigenvalue of the predictor covariance matrix to be upper bounded. Furthermore, they developed a new algorithm that can efficiently solve a general class of high-dimensional sparse minimum discrepancy problems.

Many SDR methods can be rewritten as a minimisation problem using an objective function of the form (7) $L_{1 n} (Γ, V) = t r ((γ_{n} - Σ_{n} Γ V)^{⊤} Ω_{n} (γ_{n} - Σ_{n} Γ V)),$ (7) where $γ_{n}$ , $Σ_{n}$ and $Ω_{n}$ are sample estimates for the population matrices $\tilde{M}$ , $W$ and $W^{- 1}$ . Here, $\tilde{M}$ is a $p \times l$ kernel matrix associated with a particular SDR method, where $d \leq l \leq p$ , $W$ is some $p \times p$ positive definite matrix, $Γ \in R^{p \times d}$ and $V \in R^{d \times l}$ represent parameters to be estimated by minimisation of $L_{1 n}$ . The general form of (Equation7(7) $L_{1 n} (Γ, V) = t r ((γ_{n} - Σ_{n} Γ V)^{⊤} Ω_{n} (γ_{n} - Σ_{n} Γ V)),$ (7) ) is an adaptation of the minimum discrepancy approach proposed by Cook and Ni (Citation2005). To identify the correct sparsity structure of $S_{Y | X}$ under $p ≫ n$ scenarios, Qian et al. (Citation2019) proposed to adopt coordinate-independent regularisation approach and imposed the penalty $P_{V} (Γ)$ with tuning parameter $λ_{n}$ on (Equation7(7) $L_{1 n} (Γ, V) = t r ((γ_{n} - Σ_{n} Γ V)^{⊤} Ω_{n} (γ_{n} - Σ_{n} Γ V)),$ (7) ) under the alternative constraint $V V^{⊤} = I_{d}$ , given the objective function $\begin{aligned} L_{2 n} (Γ, V) & = \frac{1}{2} t r ((γ_{n} - Σ_{n} Γ V)^{⊤} Ω_{n} (γ_{n} - Σ_{n} Γ V)) \\ + λ_{n} P_{V} (Γ), subject to V V^{⊤} = I_{d} . \end{aligned}$ Given its minimiser $(\hat{Γ}, \hat{V}) = \arg min_{Γ, V} L_{2 n} (Γ, V)$ , they simultaneously estimated $S_{Y | X}$ by $S p a n (Γ)$ and estimated $A_{0}$ by ${\hat{A}}_{0} = {1 \leq j \leq p : e_{j}^{⊤} \hat{Γ} {\hat{Γ}}^{⊤} e_{j} > 0}$ .

Tan et al. (Citation2020) considered four loss functions

General loss. $L_{G} (\hat{β}, β) = | | \hat{β} {\hat{β}}^{⊤} - β β^{⊤} | |$ ;
Projection loss. $L_{P} (\hat{β}, β) = | | \hat{β} ({\hat{β}}^{⊤} \hat{β})^{- 1} {\hat{β}}^{⊤} - β (β^{⊤} β)^{- 1} β^{⊤} | |_{F}^{2}$ ;
Prediction loss. $L_{X} (\hat{β}, β) = inf_{W \in R^{p \times p}} | | Σ^{1 / 2} (\hat{β} - β W) | |_{F}^{2}$ ;
Correlation loss. $L_{C} (\hat{β}, β) = 1 - \frac{1}{d} T r [({\hat{β}}^{⊤} Σ \hat{β})^{- 1} ({\hat{β}}^{⊤} Σ β) (β^{⊤} Σ β)^{- 1} (β^{⊤} Σ \hat{β})]$ ,

where

| | \cdot | |_{F}

denotes the Frobenius norm of a matrix. Further, Tan et al. (Citation2020) established the minimax lower bound for sparse SIR under general loss, projection loss and prediction loss. They proposed natural sparse SIR estimator and proved that the upper error bound associated with all four loss functions can match the minimax lower bound obtained, which implies that it is a rate-optimal estimator for sparse SIR. However, this optimal estimation is computational intractable. Then they developed the computational feasible counterpart for this natural sparse SIR estimator through convex relaxation. But their theoretical investigation suggested that such computational realisation for natural sparse SIR estimator cannot maintain the optimal estimation rate.

To further address this issue, they proposed a refined sparse SIR estimator. The refined sparse SIR estimator is also rate-optimal yet computational intractable. However, its computational feasible counterpart based on the adaptive estimation procedure is proven to be nearly rate-optimal. Compared to the Lasso-SIR (Lin et al., Citation2019), which was shown to be rate optimal only when $p = o (n^{2})$ , their sparse SIR approach is rate optimal even when $\log p = o (n)$ . Therefore, their proposed sparse SIR estimator certainly enjoys a much wider range of applications. The reason why Lasso-SIR fails to work when $\log p = o (n)$ is that it requires the estimation of the eigenvalues and eigenvectors of the $p \times p$ non-sparse SIR kernel matrix. It is well known that the sample eigenvalues and eigenvectors are not even consistent when p/n has a nonzero limit as $n \to \infty$ . In summary, the minimax lower bound obtained, the two rate-optimal yet computational infeasible estimators, the two corresponding computational tractable counterparts, and the theoretical upper bound of the four estimators under four-loss functions together provide a thorough understanding of sparse SIR. It is also worth noting that Lin et al. (Citation2019) is just the first step of Tan et al. (Citation2020). Bondell and Li (Citation2009) demonstrated that $S u p p (β) = A$ , then the sparse representation of SIR relies on $| A |$ , the number of truly relevant predictors, where $S u p p (β)$ denotes the support of $β$ . Assuming $| A | \leq s$ , sparse SIR is further defined through seeking $β$ such that (8) $\begin{aligned} β & = \underset{B \in R^{p \times d}}{\arg max} T r (B^{⊤} M B) s.t. B^{⊤} Σ B \\ = I_{d} and | S u p p (B) | \leq s . \end{aligned}$ (8) The above formulation of sparse SIR enjoys a similar fashion as that of sparse CCA (Gao et al., Citation2015). To get theoretical results, the following conditions are required.

(A1)	the conditional mean $E {(X - E (X) \| β^{⊤} X)}$ is linear in $X$ ;
(A2)	$\| A_{k} \|$ is bounded for $k = 1, \dots, d$ ;
(A3)	the nonzero eigenvalues $λ_{1}, \dots, λ_{d}$ are distinct;
(A4)	there exists a positive constant $a_{0}$ such that $0 < a_{0} < 1 / 4$ , $\log p / n \leq a_{0}$ and $E (\exp [t {X_{i} - E (X_{i})}^{2}]) \leq K < \infty$ , for $i = 1, \dots, p$ , and all $\| t \| \leq a_{0}$ ;
(A5)	$D_{0} = max_{1 \leq i \leq p} \sum_{j = 1}^{p} \| σ_{i j} \|$ is bounded constant as $p \to \infty$ ;
(A6)	the restricted isometry and restricted orthogonality constants $δ_{2 S_{k}}^{A^{k}}$ and $θ_{S_{k}, 2 S_{k}}^{A^{K}}$ satisfy $δ_{2 S_{k}}^{A^{k}} + θ_{S_{k}, 2 S_{k}}^{A^{K}} < 1$ , where $A^{k} = M - λ_{k} Σ$ .

See Yu et al. (Citation2013) for more details.

Theorem 3.2

Suppose that Conditions A1–A6 are satisfied, and $ζ_{k} = C_{0} (\log p / n)^{1 / 2}$ . Then $| | {\hat{β}}_{k} - β_{k} | |^{2} \leq \frac{4 C^{2} S_{K}}{(1 - δ_{2 S_{k}}^{A^{k}} - θ_{S, 2 S}^{A^{K}})^{2}} \frac{\log p}{n},$ with a probability greater than $1 - 58 p^{- τ}$ for some τ greater than $(\log p)^{- 1} \log 58,$ where $A^{k},$ $δ_{2 S_{k}}^{A^{k}}$ and $θ_{S, 2 S}^{A^{K}}$ are defined in Condition A6.

Remark 3.2

Theorem 3.2 suggests that a small price can obtain a sparse solution, as the squared estimation error of the regularised estimation is optimal up to a factor of $\log p$ . The consistency property of Lin et al. (Citation2018), Tan et al. (Citation2018) and Tan et al. (Citation2020) are similar with Theorem 3.2, but the threshold for $| | {\hat{β}}_{k} - β_{k} | |^{2}$ can be different.

4. The current literature of variable screening

Although there is a vast literature of applying sufficient dimension reduction for model-free selection, the result of developing screening consistency for the ultra-high dimensional setting is scant. Therefore many scholars are concentrated on investigating methods to achieve screening consistency.

4.1. Marginal utility

Yu, Dong, Shao (Citation2016) proposed an approach called marginal SIR for model-free variable selection. Since $M$ contains all the regression information between $Y$ and $X$ , Yu, Dong, Shao (Citation2016) considered the diagonal element of $M$ as the marginal utility for the corresponding predictor. Specially, let $e_{k}$ be the standard unit vector in $R^{p}$ with 1 being the kth element and 0 otherwise. They considered the following utility for $X_{k}$ : (9) $m_{k} = e_{k}^{⊤} Σ^{- 1} M Σ^{- 1} e_{k} .$ (9) Yu, Dong, Shao (Citation2016) refer to $m_{k}$ as the population level marginal SIR utility. To apply Dantzig selector for the estimation of the marginal SIR utility $m_{k}$ , they defined $p_{ℓ} = E {𝟙 (Y \in J_{ℓ})}$ , $ℓ = 1, \dots, H$ . Let $μ_{ℓ} = E {X 𝟙 (Y \in J_{ℓ})}$ . Then $M_{S I R} = \sum_{ℓ = 1}^{H} p_{ℓ} E (X | Y \in J_{ℓ}) E (X^{⊤} | Y \in J_{ℓ})$ can be written as $M_{S I R} = \sum_{ℓ = 1}^{H} μ_{ℓ} μ_{ℓ}^{⊤} / p_{ℓ}$ . Therefore $m_{k}^{S I R} = e_{k}^{⊤} (\sum_{ℓ = 1}^{H} ν_{ℓ} ν_{ℓ}^{⊤} / p_{ℓ}) e_{k}, ν_{ℓ} = Σ^{- 1} μ_{ℓ} .$ The marginal utility $m_{k}$ is estimated by ${\hat{m}}_{k}^{S I R} = e_{k}^{⊤} (\sum_{ℓ = 1}^{H} {\hat{ν}}_{ℓ} {\hat{ν}}_{ℓ}^{⊤} / {\hat{p}}_{ℓ}) e_{k},$ where $\hat{Σ} = \sum_{i = 1}^{n} X_{i} X_{i}^{⊤} / n$ and ${\hat{μ}}_{ℓ} = \sum_{i = 1}^{n} X_{i} 𝟙 (Y_{i} \in J_{ℓ}) / n$ . For a given threshold $b_{n}$ , the active set $A$ is estimated by including the predictors such that ${\hat{m}}_{k}^{S I R}$ exceeds $b_{n}$ or $\hat{A} = {k : {\hat{m}}_{k}^{S I R} \geq b_{n}}$ . Yu, Dong, Shao (Citation2016) take an example that $X = (X_{1}, \dots, X_{p})^{⊤} \sim N (0, Σ)$ . Let $v a r (X_{i}) = 1$ , $c o v (X_{i}, X_{j}) = 0.6$ for $| i - j | = 1$ , and $c o v (X_{i}, X_{j}) = 0$ for $| i - j | > 1$ , $1 \leq i, j \leq p$ . Let $Y = β^{⊤} X + ε$ , where $ε \sim N (0, 1)$ is independent of $X$ and $β = (1.2, - 2, 0. \dots, 0)^{⊤}$ . Then the active set for the linear regression models is $A = {1, 2}$ . Consider five utilities for $X_{1}$ : the marginal absolute Pearson correlation from Fan and Lv (Citation2008), the marginal squared distance correlation utility from Li et al. (Citation2012), the marginal fused Kolmogorov filter utility as defined in (5.3) of Yu, Dong, Shao (Citation2016), the marginal independence SIR utility as defined in (5.1) of Yu, Dong, Shao (Citation2016) and the marginal SIR utility as defined in (Equation9(9) $m_{k} = e_{k}^{⊤} Σ^{- 1} M Σ^{- 1} e_{k} .$ (9) ). Unfortunately, the first four independence screening methods will fail to recover the active predictor $X_{1}$ , only marginal SIR achieves desired result.

4.2. Trace pursuit

Yu, Dong, Zhu (Citation2016) proposed trace pursuit as a novel approach for model-free variable selection. They first extended the classical stepwise regression in linear models and proposed an STP algorithm for model-free variable selection. Furthermore, they proposed the FTP algorithm. After finding a solution path by adding one predictor into the model at a time, a modified Bayesian information criterion (BIC, hereafter) provides a chosen model that is guaranteed to include all important predictors. Finally, the two-stage trace pursuit algorithm uses FTP for initial variable screening.

For working index set $F$ and index $j \in F^{c}$ , if we want to test (10) $\begin{aligned} H_{0} : Y boxhU X_{j} | X_{F} vs. H_{a} : \\ Y is not independent of X_{j} given X_{F} . \end{aligned}$ (10) For any index set $F$ , denote $X_{F} = {X_{i} : i \in F}$ , $v a r (X_{F}) = Σ_{F}$ . Taking SIR as an example, denote $M^{S I R} = Σ_{F}^{- 1 / 2} M_{S I R} Σ_{F}^{- 1 / 2}$ . Recall that $A$ denotes the active index set satisfying $Y boxhU X_{A^{c}} | X_{A}$ , and $I = {1, \dots, p}$ denotes the full index set. It is worth noting that, if the assumption (W1) holds true, then for any index set $F$ such that $A \subseteq F \subseteq I$ , $t r (M_{A}^{S I R}) = t r (M_{F}^{S I R}) = t r (M^{S I R})$ . It suggests that $t r (M_{F}^{S I R})$ can be used to capture the strength of relationship between $Y$ and $X_{F}$ . Denote $F \cup j$ as the index set of j together with all the indices in $F$ . Given that $X_{F}$ is already in the model, then trace difference $t r (M_{F \cup j}^{S I R}) - t r (M_{F}^{S I R})$ can be used to test the contribution of the additional variable $X_{j}$ to $Y$ . The idea of using trace difference is similar to the extra sums of squares test in the classical multiple linear regression setting. The following subset LCM assumption is required in Yu, Dong, Zhu (Citation2016), (11) $\begin{aligned} E (X_{j} | X_{F}) is a linear function of X_{F} for any F \\ \subset I and j \in F^{c} . \end{aligned}$ (11) Furthermore, they also provided the STP algorithm, that is

Initialisation. Set the initial working set to be $F = \emptyset$ .
Forward addition. Find index $a_{F}$ such that $a_{F} = \arg max_{j \in F^{c}} t r ({\hat{M}}_{F \cup j}^{S I R}) .$ If $T_{a_{F} | F}^{S I R} = n {t r ({\hat{M}}_{F \cup a_{F}}^{S I R}) - t r ({\hat{M}}_{F}^{S I R})} > c^{S I R}$ , update $F$ to be $F \cup a_{F}$ .
Backward deletion. Find index $d_{F}$ such that $d_{F} = \arg max_{j \in F} t r ({\hat{M}}_{F ∖ j}^{S I R}) .$ If $T_{d_{F} | F ∖ d_{F}}^{S I R} = n {t r ({\hat{M}}_{F}^{S I R}) - t r ({\hat{M}}_{F ∖ d_{F}}^{S I R})} < c_{S I R}$ , update $F$ to be $F ∖ d_{F}$ .
Repeat steps (b) and (c) until no predictors can be added or deleted.

The test for SAVE and DR can be defined in a parallel fashion if the following CCV assumption together with the subset LCM (Equation11(11) $\begin{aligned} E (X_{j} | X_{F}) is a linear function of X_{F} for any F \\ \subset I and j \in F^{c} . \end{aligned}$ (11) ) assumption holds true $v a r (X_{j} | X_{F}) is nonrandom for any F \subseteq I and j \in F^{c} .$ Li and Dong (Citation2020) had a recent extension of trace pursuit to matrix-valued predictors. Suppose the response variable $Y \in R$ and the predictor $X \in R^{p \times q}$ have the following general relationship: (12) $Y = g (X) + ε,$ (12) where g: $R^{p \times q} \to R$ is an unknown function, ϵ is independent of $X$ , and $E (ε) = 0$ . Assume that $X$ follows the matrix normal distribution, which is denoted as $X \sim N_{p, q} (μ, U, V)$ with $μ \in R^{p \times q}$ , $U \in R^{p \times p}$ and $V \in R^{q \times q}$ . Then, the row covariance matrix is $U = E {(X - μ) (X - μ)^{⊤}} / t r (V)$ , and the column covariance matrix is $V = E {(X - μ)^{⊤} (X - μ)} / t r (U)$ .

Let $I_{r o w} = {1, \dots, p}$ be the full index set of rows and $X_{j, \cdot}$ be the jth row of $X$ for $j = 1, \dots, p$ . Define the active row set $A$ as $A = {j \in I_{r o w} : Y depends on X_{j, \cdot} in model (12)} .$ Similarly, let $I_{c o l} = {1, \dots, q}$ be the full index set of columns and $X_{\cdot, k}$ be the kth column of $X$ for $k = 1, \dots, q$ . Define the active column set $B$ as $B = {k \in I_{c o l} : Y depends on X_{\cdot, k} in model (12)} .$ Based on the active row and column predictors, model (Equation12(12) $Y = g (X) + ε,$ (12) ) can be expressed as $Y = g^{*} (X_{A, B}) + ε,$ where $g^{*} : R^{| A | \times | B |} \to R$ with $| \cdot |$ denoting the cardinality of a set, and $X_{A, B}$ denotes the submatrix of $X$ that contains the active rows indexed by $A$ and the active columns indexed by $B$ . Note that Y depends on $X$ only through $X_{A, B}$ . Li and Dong (Citation2020) introduced procedures to recover the active row set $A$ in detail. Let $X_{j, \cdot}$ , $j = 1, \dots, p$ , be the jth row of $X$ and $X_{- j, \cdot} \in R^{(p - 1) \times q}$ be the matrix that includes all but the jth row of $X$ . To test the importance of $X_{j, \cdot}$ , they considered the following row hypotheses: (13) $\begin{aligned} H_{0, {j}}^{r o w} : Y boxhU X | X_{- j, \cdot} vs. H_{a, {j}}^{r o w} : \\ Y is not independent of X given X_{- j, \cdot} . \end{aligned}$ (13) Under the null hypothesis, $H_{0, {j}}^{r o w}$ , the response Y depends on $X$ only through $X_{- j, \cdot}$ . In the special case of q = 1, $X$ becomes a p-dimensional vector, and (Equation13(13) $\begin{aligned} H_{0, {j}}^{r o w} : Y boxhU X | X_{- j, \cdot} vs. H_{a, {j}}^{r o w} : \\ Y is not independent of X given X_{- j, \cdot} . \end{aligned}$ (13) ) is equivalent to testing the importance of one component of $X$ given the other p−1 predictors. This special case is known as the marginal coordinate test (Cook, Citation2004). Let $U_{- j, - j} \in R^{(p - 1) \times (p - 1)}$ be the submatrix of $U$ that excludes the jth row and the jth column of $U$ . Define the following quantity: $δ_{j}^{r o w} = t r (M) - t r (M_{- j, \cdot}),$ where $M = U^{- 1} E (X Y) V^{- 1} E^{⊤} (X Y)$ and $M_{- j, \cdot} = U_{- j, - j}^{- 1} E (X_{- j, \cdot} Y) V^{- 1} E^{⊤} (X_{- j, \cdot} Y)$ . This trace difference $δ_{j}^{r o w}$ is the key quantity to test the importance of the jth row of $X$ , which is same as Yu, Dong, Zhu (Citation2016). Note that $δ_{j}^{r o w} = 0$ under $H_{0, {j}}^{r o w}$ .

To develop the screening consistency for ultrahigh dimensional setting, Zhu et al. (Citation2011) proposed a novel variable screening procedure under a unified model framework, which covers a wide variety of commonly used parametric and semiparametric models. They assumed that $E (X_{i}) = 0$ and $v a r (x_{i}) = 1$ for $i = 1, \dots, p$ and $Ω (Y) = E {X F (Y | X)}$ for ease of explanation. It then follows by the law of iterated expectations that $Ω (Y) = c o v {X, 𝟙 (Y < y)}$ . Let $Ω_{i} (Y)$ be the ith element of $Ω (Y)$ , and defined as $ω_{i} = E {Ω_{i}^{2} (Y)}, \dots, i = 1, \dots, p .$ Then $ω_{i}$ is to serve as the population quantity of our proposed marginal utility measure for predictor ranking. Intuitively, one can see that, if $x_{i}$ and $Y$ are independent, then $x_{i}$ and the indicator function $𝟙 (Y \leq y)$ change independently. Consequently, $ω_{i} = 0$ . On the other hand, if $x_{i}$ and $Y$ are related, then $ω_{i}$ must be positive. For ease of presentation, they assumed that the sample predictors are all standardised; that is, $n^{- 1} \sum_{j = 1}^{n} X_{j i} = 0$ and $n^{- 1} \sum_{j = 1}^{n} X_{j i}^{2} = 1$ for $i = 1, \dots, p$ . A natural estimator of $ω_{i}$ is ${\tilde{ω}}_{i} = \frac{1}{n} \sum_{j = 1}^{n} {\{\frac{1}{n} \sum_{k = 1}^{n} X_{k i} 𝟙 (Y_{k} < Y_{j})\}}^{2}, i = 1, \dots, p,$ where $X_{k i}$ denotes the kth element of $x_{i}$ . The new method does not require imposing a specific model structure on regression functions, and thus is particularly appealing to ultrahigh-dimensional regressions. They showed that, with the number of predictors growing at an exponential rate of the sample size, the proposed procedure possesses consistency in ranking, which is both useful in its own right and can lead to consistency in selection. Lin et al. (Citation2017) first introduced a large class of models depending on the smallest non-zero eigenvalue λ of the kernel matrix of SIR, then the determination of the minimax rate for estimating the central space over two classes is derived, which is the first paper that studied the minimax estimation of sparse SIR. Furthermore, they showed that the estimator based on the SIR procedure converges at rate $d \land ((s d + s \log (e p / s)) / (n λ))$ , which is the optimal rate for the single index models and multiple index models with fixed structural dimension d, fixed $s = | A |$ and λ. However, Lin et al. (Citation2017) only considered the projection loss (Li & Wang, Citation2007). More importantly, their theoretical study is actually based on the assumption that covariance matrix is diagonal.

Before discussing the consistency property, we need some conditions. Taking Yu, Dong, Shao (Citation2016) as an example,

(C1)	The coverage condition: $S p a n {Σ E (X \| Y \in J_{ℓ}) ℓ = 1, \dots, H} = S_{Y \| X}$ .
(C2)	There exist $0 < c < 1 / 4$ and $0 < q < \infty$ such that $E {\exp (t X_{k})} \leq q$ for all $\| t \| \leq c$ , $k = 1, \dots, p$ . In addition, there exist positive constants $λ_{min}$ and $λ_{max}$ such that $0 \leq λ_{min} \leq λ_{min} (Σ) \leq λ_{max} (Σ) \leq λ_{max} < \infty$ , where $λ_{min} (Σ)$ and where $λ_{max} (Σ)$ are the smallest and largest eigenvalue of $Σ$ , respectively.
(C3)	There exists $0 < f < \infty$ such that $\| \| Σ^{- 1} \| \|_{1} \leq f$ .
(C4)	There exists $0 < g < 1 - 2 \times r$ such that $f^{2} s^{2} \log p = O_{p} (n^{g})$ , where s is the cardinality of $A$ and r is specified in condition (C5).
(C5)	There exists $0 < a_{2} < \infty$ and $r \leq 1 / 2$ such that $min_{k \in A} m_{k} > 2 a_{2} n^{- r}$ .

More details please refer to Yu, Dong, Shao (Citation2016).

Theorem 4.1

Assume above conditions hold, then the shrinkage estimator $\hat{β}$ satisfies consistency in variable selection, $Pr (\hat{A} \supseteq A) \to 1.$

Theorem 4.1 is given in Yu, Dong, Shao (Citation2016), Yu, Dong, Zhu (Citation2016), Lin et al. (Citation2017), and Zhu et al. (Citation2011).

5. Minimax rate

Recently, an impressive range of penalised SIR methods has been proposed to estimate the central subspace in a sparse fashion. However, few of them considered the sparse sufficient dimension reduction from a decision-theoretical point of view. To address this issue, Tan et al. (Citation2020) established the minimax rates of convergence for estimating the sparse SIR directions under various commonly used loss functions in the literature of sufficient dimension reduction. Lin et al. (Citation2019) introduced a simple Lasso regression method to obtain an estimator of the sufficient dimension reduction space, which is only the first step of Tan et al. (Citation2020). Moreover, Tan et al. (Citation2020) discovered the possible trade-off between statistical guarantee and computational performance for sparse SIR and proposed an adaptive estimation scheme for sparse SIR which is computationally tractable and rate optimal under the condition that $\log p = o (n)$ , which is weaker than Lin et al. (Citation2019).

As we can see that, the kernel matrix $M_{S I R}$ can be estimated as ${\hat{M}}_{S I R} = \sum_{ℓ = 1}^{H} {\hat{μ}}_{ℓ} {\hat{μ}}_{ℓ}^{⊤} / {\hat{p}}_{ℓ} .$ Then it is natural to estimate $β$ via replacing $M$ and $Σ$ in (Equation8(8) $\begin{aligned} β & = \underset{B \in R^{p \times d}}{\arg max} T r (B^{⊤} M B) s.t. B^{⊤} Σ B \\ = I_{d} and | S u p p (B) | \leq s . \end{aligned}$ (8) ) by their sample estimators, which yields (14) $\begin{aligned} {\hat{β}}_{S I R} & = \underset{B \in R^{p \times d}}{\arg max} T r (B^{⊤} {\hat{M}}_{S I R} B) s.t. B^{⊤} \hat{Σ} B \\ = I_{d} and | S u p p (B) | \leq s, \end{aligned}$ (14) The solution ${\hat{β}}_{S I R}$ in (Equation14(14) $\begin{aligned} {\hat{β}}_{S I R} & = \underset{B \in R^{p \times d}}{\arg max} T r (B^{⊤} {\hat{M}}_{S I R} B) s.t. B^{⊤} \hat{Σ} B \\ = I_{d} and | S u p p (B) | \leq s, \end{aligned}$ (14) ) is called the natural sparse SIR estimator. The following theorem establishes the lower bound and upper bound of the four loss functions for the natural sparse SIR estimator.

Theorem 5.1

Assume $n λ^{2} \geq C_{0} \log \frac{e p}{s}$ for some sufficiently large constant $C_{0}$ . Then there exist positive constants C and $c_{0}$ such that $\begin{aligned} inf_{\hat{β}} sup_{P \in P} E_{P} L_{G} (\hat{β}, β) \geq C \frac{s \log (e p / s)}{n λ^{2}} \land c_{0}, \\ inf_{\hat{β}} sup_{P \in P} P \{L_{P} (\hat{β}, β) \geq C \frac{s \log (e p / s)}{n λ^{2}} \land c_{0}\} \geq 0.8, \\ inf_{\hat{β}} sup_{P \in P} P \{L_{X} (\hat{β}, β) \geq C \frac{s \log (e p / s)}{n λ^{2}} \land c_{0}\} \geq 0.8, \end{aligned}$ where $P = P (n, H, s, p, d, λ; K, m)$ .

Theorem 5.2

Since SIR can be rewritten as a least-square formulation, they finally proposed an adaptive estimation scheme for sparse SIR which is computationally tractable and rate optimal. More details about the adaptive sparse SIR estimator can refer to Tan et al. (Citation2020).

6. Further investigation

6.1. Marginal utility

Motivated by Yu, Dong, Shao (Citation2016), we can extend their method to SAVE and DR. Let $ϕ_{ℓ} = E {X X^{⊤} 𝟙 (Y \in J_{ℓ})}$ . Then $M_{S A V E} = \sum_{ℓ = 1}^{H} p_{ℓ} {Σ - v a r (X | Y \in J_{ℓ})}^{2}$ can be written as $M_{S A V E} = \sum_{ℓ = 1}^{H} p_{ℓ} {Σ - ϕ_{ℓ} / p_{ℓ} + μ_{ℓ} μ_{ℓ}^{⊤} / p_{ℓ}^{2}}^{2}$ . Therefore $\begin{aligned} m_{k}^{S A V E} & = e_{k}^{⊤} Σ^{- 1} (\sum_{ℓ = 1}^{H} p_{ℓ} {Σ - ϕ_{ℓ} / p_{ℓ} + μ_{ℓ} μ_{ℓ}^{⊤} / p_{ℓ}^{2}}^{2}) \\ \times Σ^{- 1} e_{k} . \end{aligned}$ The marginal utility $m_{k}$ is estimated by $\begin{aligned} {\hat{m}}_{k}^{S A V E} & = e_{k}^{⊤} {\hat{Σ}}^{- 1} (\sum_{ℓ = 1}^{H} {\hat{p}}_{ℓ} {\hat{Σ} - {\hat{ϕ}}_{ℓ} / {\hat{p}}_{ℓ} + {\hat{μ}}_{ℓ} {\hat{μ}}_{ℓ}^{⊤} / {\hat{p}}_{ℓ}^{2}}^{2}) \\ \times {\hat{Σ}}^{- 1} e_{k}, \end{aligned}$ where ${\hat{ϕ}}_{ℓ} = \sum_{i = 1}^{n} X_{i} X_{i}^{⊤} 𝟙 (Y_{i} \in J_{ℓ}) / n$ . For a given threshold $b_{n}$ , the active set $A$ is estimated by including the predictors such that ${\hat{m}}_{k}^{S A V E}$ exceeds $b_{n}$ or $\hat{A} = {k : {\hat{m}}_{k}^{S A V E} \geq b_{n}}$ .

Next, we consider marginal DR with the Dantzig selector. Then $\begin{aligned} M_{D R} & = \sum_{ℓ = 1}^{H} 2 p_{ℓ} [E {E^{2} (X X^{⊤})} \\ + E^{2} {E (X | Y \in J_{ℓ}) E (X^{⊤} | Y \in J_{ℓ})} \\ + E {E (X^{⊤} | Y \in J_{ℓ}) E (X | Y \in J_{ℓ})} \\ \times E {E (X | Y \in J_{ℓ}) E (X^{⊤} | Y \in J_{ℓ})}] - 2 Σ \end{aligned}$ can be written as $\begin{aligned} M_{D R} & = 2 \sum_{ℓ = 1}^{H} p_{ℓ} {(ϕ_{ℓ} - Σ)}^{2} + 2 {(\sum_{ℓ = 1}^{H} μ_{ℓ} μ_{ℓ}^{⊤} / p_{ℓ}^{2})}^{2} \\ + 2 (\sum_{ℓ = 1}^{H} μ_{ℓ}^{⊤} μ_{ℓ} / p_{ℓ}^{2}) (\sum_{ℓ = 1}^{H} μ_{ℓ} μ_{ℓ}^{⊤} / p_{ℓ}^{2}) . \end{aligned}$ Therefore $\begin{aligned} m_{k}^{D R} & = 2 e_{k}^{⊤} Σ^{- 1} \{\sum_{ℓ = 1}^{H} p_{ℓ} {(ϕ_{ℓ} - Σ)}^{2} \\ + {(\sum_{ℓ = 1}^{H} μ_{ℓ} μ_{ℓ}^{⊤} / p_{ℓ}^{2})}^{2} \\ + (\sum_{ℓ = 1}^{H} μ_{ℓ}^{⊤} μ_{ℓ} / p_{ℓ}^{2}) (\sum_{ℓ = 1}^{H} μ_{ℓ} μ_{ℓ}^{⊤} / p_{ℓ}^{2})\} Σ^{- 1} e_{k} . \end{aligned}$ The marginal utility $m_{k}$ is estimated by $\begin{aligned} {\hat{m}}_{k}^{D R} & = 2 e_{k}^{⊤} {\hat{Σ}}^{- 1} \{\sum_{ℓ = 1}^{H} {\hat{p}}_{ℓ} {({\hat{ϕ}}_{ℓ} - \hat{Σ})}^{2} \\ + {(\sum_{ℓ = 1}^{H} {\hat{μ}}_{ℓ} {\hat{μ}}_{ℓ}^{⊤} / {\hat{p}}_{ℓ}^{2})}^{2} \\ + (\sum_{ℓ = 1}^{H} {\hat{μ}}_{ℓ}^{⊤} {\hat{μ}}_{ℓ} / {\hat{p}}_{ℓ}^{2}) (\sum_{ℓ = 1}^{H} {\hat{μ}}_{ℓ} {\hat{μ}}_{ℓ}^{⊤} / {\hat{p}}_{ℓ}^{2})\} {\hat{Σ}}^{- 1} e_{k} . \end{aligned}$ For a given threshold $b_{n}$ , the active set $A$ is estimated by including the predictors such that ${\hat{m}}_{k}^{D R}$ exceeds $b_{n}$ , or $\hat{A} = {k : {\hat{m}}_{k}^{D R} \geq b_{n}}$ . Following the proof of Yu, Dong, Shao (Citation2016), we can expect the marginal SAVE and DR with the Dantzig selector to achieve selection consistency.

6.2. Minimax rate

Motivated by Tan et al. (Citation2020), we can further investigate the natural sparse SAVE estimator and upper error bound. Let $E_{n} X = \frac{1}{n} \sum_{i = 1}^{n} X_{i}$ and $\hat{Σ} = \frac{1}{n - 1} \sum_{i = 1}^{n} (X_{i} - E_{n} X) (X_{i} - E_{n} X)^{⊤}$ be the sample mean and sample covariance of $X$ , then the SAVE kernel matrix $M$ is estimated as ${\hat{M}}_{S A V E} = \sum_{ℓ = 1}^{H} {\hat{p}}_{ℓ} {\hat{Σ} - {\hat{ϕ}}_{ℓ} / {\hat{p}}_{ℓ} + {\hat{μ}}_{ℓ} {\hat{μ}}_{ℓ}^{⊤} / {\hat{p}}_{ℓ}^{2}}^{2} .$ Similarly, the DR kernel matrix is estimated as $\begin{aligned} {\hat{M}}_{D R} & = 2 \sum_{ℓ = 1}^{H} {\hat{p}}_{ℓ} {({\hat{ϕ}}_{ℓ} - \hat{Σ})}^{2} + 2 {(\sum_{ℓ = 1}^{H} {\hat{μ}}_{ℓ} {\hat{μ}}_{ℓ}^{⊤} / {\hat{p}}_{ℓ}^{2})}^{2} \\ + 2 (\sum_{ℓ = 1}^{H} {\hat{μ}}_{ℓ}^{⊤} {\hat{μ}}_{ℓ} / {\hat{p}}_{ℓ}^{2}) (\sum_{ℓ = 1}^{H} {\hat{μ}}_{ℓ} {\hat{μ}}_{ℓ}^{⊤} / {\hat{p}}_{ℓ}^{2}) . \end{aligned}$ Then it is natural to estimate $β$ via replacing $M$ and $Σ$ in (Equation8(8) $\begin{aligned} β & = \underset{B \in R^{p \times d}}{\arg max} T r (B^{⊤} M B) s.t. B^{⊤} Σ B \\ = I_{d} and | S u p p (B) | \leq s . \end{aligned}$ (8) ) by their sample estimators, which yields (15) $\begin{aligned} {\hat{β}}_{S A V E} & = \underset{B \in R^{p \times d}}{\arg max} T r (B^{⊤} {\hat{M}}_{S A V E} B) s.t. B^{⊤} \hat{Σ} B \\ = I_{d} and | S u p p (B) | \leq s, \\ {\hat{β}}_{D R} & = \underset{B \in R^{p \times d}}{\arg max} T r (B^{⊤} {\hat{M}}_{D R} B) s.t. B^{⊤} \hat{Σ} B \\ = I_{d} and | S u p p (B) | \leq s \end{aligned}$ (15) The solution ${\hat{β}}_{S A V E}$ and ${\hat{β}}_{D R}$ in (Equation15(15) $\begin{aligned} {\hat{β}}_{S A V E} & = \underset{B \in R^{p \times d}}{\arg max} T r (B^{⊤} {\hat{M}}_{S A V E} B) s.t. B^{⊤} \hat{Σ} B \\ = I_{d} and | S u p p (B) | \leq s, \\ {\hat{β}}_{D R} & = \underset{B \in R^{p \times d}}{\arg max} T r (B^{⊤} {\hat{M}}_{D R} B) s.t. B^{⊤} \hat{Σ} B \\ = I_{d} and | S u p p (B) | \leq s \end{aligned}$ (15) ) are called the natural sparse SAVE and DR estimator. The following theorem establishes the lower bound and upper bound of the four loss functions for the natural sparse SAVE and DR estimator.

Theorem 6.1

Theorem 6.2

Assume that $\frac{s \log (e p / s)}{n λ^{2}} \leq c$ for some small constant $c \in (0, 1)$ . Then for any $C^{'} > 0,$ there exists a positive constant C such that $\begin{aligned} L_{G} (\hat{β}, β) \lor L_{P} (\hat{β}, β) \lor L_{X} (\hat{β}, β) \lor L_{C} (\hat{β}, β) \\ \leq C \frac{s \log (e p / s)}{n λ^{2}} \end{aligned}$ with probability greater than $1 - 2 \exp (- C^{'} (s + \log (e p / s)))$ uniformly over $P \in P (n, H, s, p, d, λ; K, m)$ . In which $\hat{β}$ is constructed in (Equation15(15) $\begin{aligned} {\hat{β}}_{S A V E} & = \underset{B \in R^{p \times d}}{\arg max} T r (B^{⊤} {\hat{M}}_{S A V E} B) s.t. B^{⊤} \hat{Σ} B \\ = I_{d} and | S u p p (B) | \leq s, \\ {\hat{β}}_{D R} & = \underset{B \in R^{p \times d}}{\arg max} T r (B^{⊤} {\hat{M}}_{D R} B) s.t. B^{⊤} \hat{Σ} B \\ = I_{d} and | S u p p (B) | \leq s \end{aligned}$ (15) ).

Following by Tan et al. (Citation2020), $\hat{β}$ in (Equation15(15) $\begin{aligned} {\hat{β}}_{S A V E} & = \underset{B \in R^{p \times d}}{\arg max} T r (B^{⊤} {\hat{M}}_{S A V E} B) s.t. B^{⊤} \hat{Σ} B \\ = I_{d} and | S u p p (B) | \leq s, \\ {\hat{β}}_{D R} & = \underset{B \in R^{p \times d}}{\arg max} T r (B^{⊤} {\hat{M}}_{D R} B) s.t. B^{⊤} \hat{Σ} B \\ = I_{d} and | S u p p (B) | \leq s \end{aligned}$ (15) ) is rate optimal under general loss, projection loss and prediction loss. Moreover, the natural sparse SAVE estimator ${\hat{β}}_{S A V E}$ and DR estimator ${\hat{β}}_{D R}$ can also be regarded as one optimal estimator for the SAVE directions and DR directions. However, the estimation procedure (Equation15(15) $\begin{aligned} {\hat{β}}_{S A V E} & = \underset{B \in R^{p \times d}}{\arg max} T r (B^{⊤} {\hat{M}}_{S A V E} B) s.t. B^{⊤} \hat{Σ} B \\ = I_{d} and | S u p p (B) | \leq s, \\ {\hat{β}}_{D R} & = \underset{B \in R^{p \times d}}{\arg max} T r (B^{⊤} {\hat{M}}_{D R} B) s.t. B^{⊤} \hat{Σ} B \\ = I_{d} and | S u p p (B) | \leq s \end{aligned}$ (15) ) depends on the unknown sparsity parameter s and is computationally infeasible as it involves exhaustive search over all $B \in R^{p \times d}$ subject to the sparsity constraint. Tan et al. (Citation2020) defined a refined sparse SIR estimator based on that SIR can be viewed as transformation-based projection pursuit. Since SAVE and DR cannot be rewritten as a least-square formulation, we do not define refined sparse SAVE and DR estimator.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

This research is supported by the National Natural Science Foundation of China Grant 11971170, the 111 project B14019 and the Program for Professor of Special Appointment (Eastern Scholar) at Shanghai Institutions of Higher Learning.

Notes on contributors

Lu Li

Lu Li is currently a Ph.D student at School of Statistics, East China Normal University.

Xuerong Meggie Wen

Dr Xuerong Meggie Wen is currently an associate professor of Statistics at Dept. of Mathematics and Statistics, Missouri University of Science and Technology.

Zhou Yu

Dr Zhou Yu is a Professor of Statistics at School of Statistics, East China Normal Univercity.

References

Bondell, H. D., & Li, L. (2009). Shrinkage inverse regression estimation for model–free variable selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 71(1), 287–299. https://doi.org/10.1111/j.1467-9868.2008.00686.x.
Google Scholar
Breiman, L. (1995). Better subset regression using the nonnegative garrote. Technometrics, 37(4), 373–384. https://doi.org/10.1080/00401706.1995.10484371
Web of Science ®Google Scholar
Candes, E., & Tao, T. (2007). The Dantzig selector: Statistical estimation when p is much larger than n. The Annals of Statistics, 35(6), 2313–2351. https://doi.org/10.1214/009053606000001523
Web of Science ®Google Scholar
Chen, J., Stern, M., Wainwright, M. J., & Jordan, M. I. (2017). Kernel feature selection via conditional covariance minimization. In Advances in Neural Information Processing Systems (pp. 6946–6955).
Google Scholar
Chen, X., Zou, C., & Cook, R. D. (2010). Coordinate-independent sparse sufficient dimension reduction and variable selection. The Annals of Statistics, 38(6), 3696–3723. https://doi.org/10.1214/10-AOS826
Web of Science ®Google Scholar
Cook, R. D. (1996). Graphics for regressions with a binary response. Journal of the American Statistical Association, 91(435), 983–992. https://doi.org/10.1080/01621459.1996.10476968
Web of Science ®Google Scholar
Cook, R. D. (1998). Regression graphics. Wiley.
Google Scholar
Cook, R. D. (2004). Testing predictor contributions in sufficient dimension reduction. The Annals of Statistics, 32(3), 1062–1092. https://doi.org/10.1214/009053604000000292
Web of Science ®Google Scholar
Cook, R. D., & Forzani, L. (2008). Principal fitted components for dimension reduction in regression. Statistical Science, 23(4), 485–501. https://doi.org/10.1214/08-STS275
Web of Science ®Google Scholar
Cook, R. D., & Forzani, L. (2009). Likelihood-based sufficient dimension reduction. Journal of the American Statistical Association, 104(485), 197–208. https://doi.org/10.1198/jasa.2009.0106
Web of Science ®Google Scholar
Cook, R. D., & Ni, L. (2005). Sufficient dimension reduction via inverse regression: A minimum discrepancy approach. Journal of the American Statistical Association, 100(470), 410–428. https://doi.org/10.1198/016214504000001501
Web of Science ®Google Scholar
Cook, R. D., & Weisberg, S. (1991). Sliced inverse regression for dimension reduction: Comment. Journal of the American Statistical Association, 86(414), 328–332. https://doi.org/10.2307/2290564.
Google Scholar
Fan, J., & Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456), 1348–1360. https://doi.org/10.1198/016214501753382273
Web of Science ®Google Scholar
Fan, J., & Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(5), 849–911. https://doi.org/10.1111/rssb.2008.70.issue-5 doi: 10.1111/j.1467-9868.2008.00674.x
PubMed Web of Science ®Google Scholar
Fukumizu, K., Bach, F. R., & Jordan, M. I. (2009). Kernel dimension reduction in regression. The Annals of Statistics, 37(4), 1871–1905. https://doi.org/10.1214/08-AOS637
Web of Science ®Google Scholar
Fukumizu, K., & Leng, C. (2014). Gradient-based kernel dimension reduction for regression. Journal of the American Statistical Association, 109(505), 359–370. https://doi.org/10.1080/01621459.2013.838167
Web of Science ®Google Scholar
Fung, W. K., He, X., Liu, L., & Shi, P. (2002). Dimension reduction based on canonical correlation. Statistica Sinica, 12(2002), 1093–1113. https://www.jstor.org/stable/24307017.
Google Scholar
Gao, C., Ma, Z., Ren, Z., & Zhou, H. H. (2015). Minimax estimation in sparse canonical correlation analysis. The Annals of Statistics, 43(5), 2168–2197. https://doi.org/10.1214/15-AOS1332
Web of Science ®Google Scholar
Gretton, A., Bousquet, O., Smola, A., & Scholkopf, B. (2005). Measuring statistical dependence with Hilbert-Schmidt norms. In International Conference on Algorithmic Learning Theory (pp. 63–77). Springer.
Google Scholar
Li, K. C. (1991). Sliced inverse regression for dimension reduction. Journal of the American Statistical Association, 86(414), 316–327. https://doi.org/10.1080/01621459.1991.10475035
Web of Science ®Google Scholar
Li, K. C. (1992). On principal Hessian directions for data visualization and dimension reduction: Another application of Stein's lemma. Journal of the American Statistical Association, 87(420), 1025–1039. https://doi.org/10.1080/01621459.1992.10476258
Web of Science ®Google Scholar
Li, K. C. (2000). High dimensional data analysis via the SIR/PHD approach. Lecture Note in Progress.
Google Scholar
Li, L. (2007). Sparse sufficient dimension reduction. Biometrika, 94(3), 603–613. https://doi.org/10.1093/biomet/asm044
Web of Science ®Google Scholar
Li, Z., & Dong, Y. (2020). Model free variable selection with matrix-valued predictors. Journal of Computational and Graphical Statistics, 27, 1–11. https://doi.org/10.1080/10618600.2020.1806854
Google Scholar
Li, L., & Nachtsheim, C. J. (2006). Sparse sliced inverse regression. Technometrics, 48(4), 503–510. https://doi.org/10.1198/004017006000000129
Web of Science ®Google Scholar
Li, B., & Wang, S. (2007). On directional regression for dimension reduction. Journal of the American Statistical Association, 102(479), 997–1008. https://doi.org/10.1198/016214507000000536
Web of Science ®Google Scholar
Li, L., & Yin, X. (2008). Sliced inverse regression with regularizations. Biometrics, 64(1), 124–131. https://doi.org/10.1111/j.1541-0420.2007.00836.x
PubMed Web of Science ®Google Scholar
Li, R., Zhong, W., & Zhu, L. (2012). Feature screening via distance correlation learning. Journal of the American Statistical Association, 107(499), 1129–1139. https://doi.org/10.1080/01621459.2012.695654
PubMed Web of Science ®Google Scholar
Lin, Q., Li, X., Huang, D., & Liu, J. S. (2017). On the optimality of sliced inverse regression in high dimensions. arXiv preprint arXiv:1701.06009.
Google Scholar
Lin, Q., Zhao, Z., & Liu, J. (2019). Sparse sliced inverse regression via Lasso. Journal of the American Statistical Association, 114(528), 1726–1739. https://doi.org/10.1080/01621459.2018.1520115
PubMed Web of Science ®Google Scholar
Lin, Q., Zhao, Z., & Liu, J. S. (2018). On consistency and sparsity for sliced inverse regression in high dimensions. The Annals of Statistics, 46(2), 580–610. https://doi.org/10.1214/17-AOS1561
Web of Science ®Google Scholar
Ma, Y., & Zhu, L. (2012). A semiparametric approach to dimension reduction. Journal of the American Statistical Association, 107(497), 168–179. https://doi.org/10.1080/01621459.2011.646925
PubMed Web of Science ®Google Scholar
Ma, Y., & Zhu, L. (2013a). A review on dimension reduction. International Statistical Review, 81(1), 134–150. https://doi.org/10.1111/insr.2013.81.issue-1 doi: 10.1111/j.1751-5823.2012.00182.x
PubMed Web of Science ®Google Scholar
Ma, Y., & Zhu, L. (2013b). Efficient estimation in sufficient dimension reduction. The Annals of Statistics, 41(1), 250–268. https://doi.org/10.1214/12-AOS1072
PubMed Web of Science ®Google Scholar
Ma, Y., & Zhu, L. (2014). On estimation efficiency of the central mean subspace. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(5), 885–901. https://doi.org/10.1111/rssb.2014.76.issue-5 doi: 10.1111/rssb.12044
Web of Science ®Google Scholar
Ni, L., Cook, R. D., & Tsai, C. L. (2005). A note on shrinkage sliced inverse regression. Biometrika, 92(1), 242–247. https://doi.org/10.1093/biomet/92.1.242
Web of Science ®Google Scholar
Qian, W., Ding, S., & Cook, R. D. (2019). Sparse minimum discrepancy approach to sufficient dimension reduction with simultaneous variable selection in ultrahigh dimension. Journal of the American Statistical Association, 114(527), 1277–1290. https://doi.org/10.1080/01621459.2018.1497498
Web of Science ®Google Scholar
Tan, K., Shi, L., & Yu, Z. (2020). Sparse SIR: Optimal rates and adaptive estimation. The Annals of Statistics, 48(1), 64–85. https://doi.org/10.1214/18-AOS1791
Web of Science ®Google Scholar
Tan, K. M., Wang, Z., Zhang, T., Liu, H., & Cook, R. D. (2018). A convex formulation for high-dimensional sparse sliced inverse regression. Biometrika, 105(4), 769–782. https://doi-org.ezproxy.uky.edu/10.1093/biomet/asy049.
Web of Science ®Google Scholar
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1), 267–288. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x.
Web of Science ®Google Scholar
Wang, H., & Xia, Y. (2008). Sliced regression for dimension reduction. Journal of the American Statistical Association, 103(482), 811–821. https://doi.org/10.1198/016214508000000418
Web of Science ®Google Scholar
Wu, Y., & Li, L. (2011). Asymptotic properties of sufficient dimension reduction with a diverging number of predictors. Statistica Sinica, 21(2), 707. https://doi.org/10.5705/ss.2011.v21n2a doi: 10.5705/ss.2011.031a
Web of Science ®Google Scholar
Xia, Y., Tong, H., Li, W. K., & Zhu, L. X. (2002). An adaptive estimation of dimension reduction space. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 64(3), 363–410. https://doi.org/10.1111/rssb.2002.64.issue-3 doi: 10.1111/1467-9868.03411
Web of Science ®Google Scholar
Yin, X., & Cook, R. D. (2002). Dimension reduction for the conditional kth moment in regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 64(2), 159–175. https://doi.org/10.1111/rssb.2002.64.issue-2 doi: 10.1111/1467-9868.00330
Google Scholar
Yin, X., & Cook, R. D. (2003). Estimating central subspaces via inverse third moments. Biometrika, 90(1), 113–125. https://doi.org/10.1093/biomet/90.1.113
Web of Science ®Google Scholar
Yin, X., & Hilafu, H. (2015). Sequential sufficient dimension reduction for large p, small n problems. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 77(4), 879–892. https://doi.org/10.1111/rssb.2015.77.issue-4 doi: 10.1111/rssb.12093
Web of Science ®Google Scholar
Yin, X., Li, B., & Cook, R. D. (2008). Successive direction extraction for estimating the central subspace in a multiple-index regression. Journal of Multivariate Analysis, 99(8), 1733–1757. https://doi.org/10.1016/j.jmva.2008.01.006
Web of Science ®Google Scholar
Yu, Z., Dong, Y., & Shao, J. (2016). On marginal sliced inverse regression for ultrahigh dimensional model-free feature selection. The Annals of Statistics, 44(6), 2594–2623. https://doi.org/10.1214/15-AOS1424
Web of Science ®Google Scholar
Yu, Z., Dong, Y., & Zhu, L. X. (2016). Trace pursuit: A general framework for model-free variable selection. Journal of the American Statistical Association, 111(514), 813–821. https://doi.org/10.1080/01621459.2015.1050494
Web of Science ®Google Scholar
Yu, Z., Zhu, L., Peng, H., & Zhu, L. (2013). Dimension reduction and predictor selection in semiparametric models. Biometrika, 100(3), 641–654. https://doi.org/10.1093/biomet/ast005
Web of Science ®Google Scholar
Yuan, M., & Lin, Y. (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1), 49–67. https://doi.org/10.1111/rssb.2006.68.issue-1 doi: 10.1111/j.1467-9868.2005.00532.x
Web of Science ®Google Scholar
Zhang, C. H. (2010). Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, 38(2), 894–942. https://doi.org/10.1214/09-AOS729
Web of Science ®Google Scholar
Zhou, J., & He, X. (2008). Dimension reduction based on constrained canonical correlation and variable filtering. The Annals of Statistics, 36(4), 1649–1668. https://doi.org/10.1214/07-AOS529
Web of Science ®Google Scholar
Zhu, L. P., Li, L., Li, R., & Zhu, L. X. (2011). Model-free feature screening for ultrahigh-dimensional data. Journal of the American Statistical Association, 106(496), 1464–1475. https://doi.org/10.1198/jasa.2011.tm10563
PubMed Web of Science ®Google Scholar
Zhu, L., Miao, B., & Peng, H. (2006). On sliced inverse regression with high-dimensional covariates. Journal of the American Statistical Association, 101(474), 630–643. https://doi.org/10.1198/016214505000001285
Web of Science ®Google Scholar
Zhu, L. P., & Zhu, L. X. (2009a). On distribution–weighted partial least squares with diverging number of highly correlated predictors. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 71(2), 525–548. https://doi.org/10.1111/rssb.2009.71.issue-2 doi: 10.1111/j.1467-9868.2008.00697.x
Google Scholar
Zhu, L. P., & Zhu, L. X. (2009b). Nonconcave penalized inverse regression in single-index models with high dimensional predictors. Journal of Multivariate Analysis, 100(5), 862–875. https://doi.org/10.1016/j.jmva.2008.09.003
Web of Science ®Google Scholar
Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101(476), 1418–1429. https://doi.org/10.1198/016214506000000735
Web of Science ®Google Scholar

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Order Reprints Request Corporate Permissions

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

Request Academic Permissions

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.

Download PDF

Share icon
Back to Top

Related research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.

People also read
Recommended articles
Cited by

To cite this article:

Reference style: APA Chicago Harvard

Citation copied to clipboard

Reference styles above use APA (6th edition), Chicago (16th edition) & Harvard (10th edition)

Download citation

Download a citation file in RIS format that can be imported by citation management software including EndNote, ProCite, RefWorks and Reference Manager.

Choose format: RIS BibTex RefWorks Direct Export

Choose options: Citation Citation & abstract Citation & references

Your download is now in progress and you may close this window

Did you know that with a free Taylor & Francis Online account you can gain access to the following benefits?

Choose new content alerts to be informed about new research of interest to you
Easy remote access to your institution's subscriptions on any device, from any location
Save your searches and schedule alerts to send you new results
Export your search results into a .csv file to support your research

Have an account?
Login now Don't have an account?
Register for free

Login or register to access this feature

Have an account?
Login now Don't have an account?
Register for free

Choose new content alerts to be informed about new research of interest to you
Easy remote access to your institution's subscriptions on any device, from any location
Save your searches and schedule alerts to send you new results
Export your search results into a .csv file to support your research

A selective overview of sparse sufficient dimension reduction

Abstract

1. Introduction

2. Review of sufficient dimension reduction

3. The current literature of variable selection via sufficient dimension reduction

3.1. Oracle property under the setting p<n

3.2. Oracle property under the setting $p ≫ n$

4. The current literature of variable screening

4.1. Marginal utility

4.2. Trace pursuit

5. Minimax rate

6. Further investigation

6.1. Marginal utility

6.2. Minimax rate

Disclosure statement

Notes on contributors

Lu Li

Xuerong Meggie Wen

Zhou Yu

References

Information for

Open access

Opportunities

Help and information

A selective overview of sparse sufficient dimension reduction

Abstract

1. Introduction

2. Review of sufficient dimension reduction

3. The current literature of variable selection via sufficient dimension reduction

3.1. Oracle property under the setting p<n

3.2. Oracle property under the setting p≫n

4. The current literature of variable screening

4.1. Marginal utility

4.2. Trace pursuit

5. Minimax rate

6. Further investigation

6.1. Marginal utility

6.2. Minimax rate

Disclosure statement

Additional information

Funding

Notes on contributors

Lu Li

Xuerong Meggie Wen

Zhou Yu

References

Reprints and Corporate Permissions

Academic Permissions

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date

3.2. Oracle property under the setting $p ≫ n$