Full article: Sparse kernel sufficient dimension reduction

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

The sufficient dimension reduction (SDR) with sparsity has received much attention for analysing high-dimensional data. We study a nonparametric sparse kernel sufficient dimension reduction (KSDR) based on the reproducing kernel Hilbert space, which extends the methodology of the sparse SDR based on inverse moment methods. We establish the statistical consistency and efficient estimation of the sparse KSDR under the high-dimensional setting where the dimension diverges as the sample size increases. Computationally, we introduce a new nonconvex alternating directional method of multipliers (ADMM) to solve the challenging sparse SDR and propose the nonconvex linearised ADMM to solve the more challenging sparse KSDR. We study the computational guarantees of the proposed ADMMs and show an explicit iteration complexity bound to reach the stationary solution. We demonstrate the finite-sample properties in simulation studies and a real application.

KEYWORDS:

1. Introduction

Consider the sufficient dimension reduction (SDR) of high-dimensional random vector $x \in R^{p}$ in the presence of response $y \in R$ . We aim to find the low-dimensional central subspace $S_{y | x} \subset R^{p}$ such that $\dim (S_{y | x}) = d ≪ p$ and $y ⊥ ⊥ x | P_{S_{y | x}} (x)$ , where the dimension p may diverge as the sample size increases, and $P_{S_{y | x}} (\cdot)$ is the projection of a vector onto the subspace $S_{y | x}$ . The inverse moment methods are extensively studied under the fixed-dimension setting, including the sliced inverse regression (SIR) (Li Citation1991), sliced average variance estimation (SAVE) (Cook and Weisberg Citation1991), and directional regression (DR) (Li and Wang Citation2007). Fan, Xue, and Yao (Citation2017), Luo, Xue, Yao, and Yu (Citation2022), and Yu, Yao, and Xue (Citation2022) extended these inverse moment methods and introduced sufficient forecasting using factor models for high-dimensional predictors when both $n, p \to \infty$ . Ying and Yu (Citation2022), Zhang, Xue, and Li (Citation2024), and Zhang, Li, and Xue (Citation2024) studied the sufficient dimension reduction for Fréchet regression and distribution-on-distribution regression. (Lin, Zhao, and Liu Citation2018; Lin, Zhao, and Liu Citation2019) introduced the Lasso-SIR under the high-dimensional setting when $p / n \to 0$ and the diagonal thresholding SIR (DT-SIR) introduced a new diagonal thresholding screening under the ultra high-dimensional setting when $p / n \to \infty$ . However, the restricted linearity condition on the conditional expectation of $x$ given $P_{S_{y | x}} (x)$ is required for inverse moment methods. To remove the linearity condition, the kernel sufficient dimension reduction (KSDR) was proposed by Fukumizu, Bach, and Jordan (Citation2009). The nonconvex KSDR does not require the linearity condition nor the elliptical distribution of $x$ , but it poses a significant computational challenge. Wu, Miller, Chang, Sznaier, and Dy (Citation2019) solved this problem using an iterative spectral method, but such a method is hard to extend when a sparsity assumption is assumed in high-dimensional settings. Shi, Huang, Feng, and Suykens (Citation2019) studies nonconvex penalty in a kernel regression problem. Although both the theoretical property and optimisation algorithm cannot be applied in SDR settings, it shows the promises of nonconvex penalised KSDR in high-dimensional settings.

The sparse estimation of central subspace has received considerable attention in the past decade. The main focus of the literature is on sparse SDR based on inverse moment methods, which is equivalent to solving the generalised eigenvectors from the estimated kernel matrix (Li Citation2007). For example, the kernel matrix for SIR is the conditional covariance matrix of $x$ given y. In view of such equivalence, Li (Citation2007) followed the similar spirit of the sparse principal component analysis (SPCA) (Zou, Hastie, and Tibshirani Citation2006; Zou and Xue Citation2018) to propose the sparse SDR using $ℓ_{1}$ penalisation. Bondell and Li (Citation2009) proposed a shrinkage inverse regression estimation and studied the asymptotic consistency and asymptotic normality. Chen, Zou, and Cook (Citation2010) proposed a coordinate-independent sparse estimation method and established the oracle property when the dimension p is fixed. Lin et al. (Citation2019) established the consistency of the Lasso-SIR by assuming the number of slices also diverges. Neykov, Lin, and Liu (Citation2016) followed the convex semidefinite programming (d'Aspremont, Ghaoui, Jordan, and Lanckriet Citation2005; Amini and Wainwright Citation2008) to give a high-dimensional analysis for the sign consistency of the DT-SIR in a single-index model. Tan, Wang, Zhang, Liu, and Cook (Citation2018) proposed a sparse SIR estimator based on convex formulation and established the Frobenius norm convergence of projection subspace. Lin et al. (Citation2018) established the consistency of DT-SIR by assuming a stability condition for $E (x | y)$ . Tan, Shi, and Yu (Citation2020) proved the optimality of $ℓ_{0}$ -constrained sparse solution and provided a three-steps adaptive estimation to approximate $ℓ - 0$ solution with near-optimal property. The sparse estimation in the KSDR area has not been well studied.

In this paper, we propose the nonconvex estimation scheme for sparse KSDR, motivated by the nonconvex estimation of sparse SDR (Li Citation2007). The proposed methods target the ideal $ℓ_{0}$ -constrained estimator and can be efficiently solved with statistical and computational guarantees after approximation. We follow d'Aspremont et al. (Citation2005) to reformulate sparse SDR and sparse KSDR as the $ℓ_{0}$ -constrained trace maximisation problem. It is worth pointing out that KSDR is also equivalent to solving the eigenvectors from the nonconvex residual covariance estimator in RKHS (Fukumizu et al. Citation2009). The residual covariance estimator in RKHS plays a similar role as the kernel matrix in inverse moment methods. Thus, we propose that sparse SDR solves the $ℓ_{0}$ -constrained convex M-estimation problem, whereas sparse KSDR solves the $ℓ_{0}$ -constrained nonconvex M-estimation problem. We provide a high-dimensional analysis for the global solution of the $ℓ_{0}$ -constrained SDR and prove its asymptotic properties, including both variable selection consistency and convergence rate. We demonstrate the power of this general theory by obtaining the explicit convergence rates for the $ℓ_{0}$ -constrained SDR under the high-dimensional setting when the dimension p diverges. Furthermore, we also provide a high-dimensional analysis of the global solution of the $ℓ_{0}$ -constrained KSDR and prove its asymptotic properties, such as the consistency in variable selection and estimation, without assuming the linearity condition.

Computationally, the $ℓ_{0}$ -constrained trace maximisation is NP-hard. There has been some work showing success in adjusting the weight of coefficients adaptively, using convex penalties such as adaptive-lasso (Zou Citation2006) and adaptive-ridge regression (Frommlet and Nuel Citation2016). The performance of adaptive-based approaches heavily relies on the accuracy of initial estimation. Seeking for a more robust $ℓ_{0}$ penalisation approximation without adaptive procedure, we use the folded concave penalisation (Fan and Li Citation2001; Fan, Xue, and Zou Citation2014) as an alternative to the $ℓ_{1}$ penalisation in Li (Citation2007), Lin et al. (Citation2019), and Lin et al. (Citation2018). We propose the novel nonconvex alternating direction method of multipliers (ADMM) to solve the folded concave penalised SDR and the nonconvex linearised ADMM to solve the folded concave penalised KSDR. In particular, sparse KSDR has not only a nonconvex loss function but also a folded concave penalty and a positive definite constraint. We provide the important convergence guarantees of the proposed nonconvex ADMMs for solving two challenging nonconvex and nonsmooth problems. We also establish the explicit iteration complexity bound for the proposed nonconvex ADMM and its linearised variant to reach the stationary solution of sparse SDR and sparse KSDR, respectively. To the best of our knowledge, our work is the first in the literature to study the challenging nonconvex and nonsmooth optimisation problem for sparse KSDR.

The rest of this paper is organised as follows. We will revisit the methods for sparse SDR in Section 2 and then present the proposed methods for sparse KSDR in Section 3. The proposed nonconvex ADMM and its linearised variant are presented in Sectio 4, and statistical and computational guarantees are presented in Section 5. Section 6 evaluates the numerical performance in several different simulation settings. Section 7 applies sparse KDR in a real data analysis. Section 7 includes a few concluding remarks.

Before proceeding, we define some useful notations. Let $‖ θ ‖_{0} = card {θ_{j} \neq 0, j = 1, \dots d}$ , and $| S |$ is the cardinality of a set $S$ . For a matrix $A$ , let $‖ A ‖_{\infty}$ be the maximum norm, $‖ A ‖_{1}$ the matrix $ℓ_{1}$ norm, $‖ A ‖_{2}$ the spectral norm, $‖ A ‖_{F}$ the Frobenius norm, $‖ A ‖_{tr}$ the trace norm, defined as $‖ A ‖_{tr} = Tr [(A^{T} A)^{1 / 2}]$ . Let $‖ \cdot ‖_{H_{d}}$ denote the norm in d-dimensional Hilbert space. To estimate the central subspace, it is equivalent to finding a low-rank matrix $Θ \in R^{p \times d}$ of full column rank such that $y ⊥ ⊥ x | Θ^{T} x$ . We may also denote the central subspace by $S_{y | x} = span (Θ)$ , as the column space of matrix $Θ$ .

2. Preliminaries

This section first revisits the sparse SDR methods in Li (Citation2007) and then follows d'Aspremont et al. (Citation2005) to reformulate sparse SDR via the trace maximisation problem. Note that SDR aims to estimate the linear combination $Θ$ such that $y ⊥ ⊥ x | Θ^{T} x$ given independently and identically distributed data $(x_{1}, y_{1}), \dots, (x_{n}, y_{n})$ . Li (Citation2007) showed that SDR methods based on inverse moments, such as SIR, DR, SAVE, and their variants, essentially solve the generalised eigenvector of a certain kernel matrix. When we assume the product of the inverse covariance matrix and kernel matrix is a normal matrix, the problem is also equivalent to solving an eigenvector of a certain kernel matrix. More specifically, at population level, the kernel matrix of SIR is $M_{sir}^{⋆} = Σ_{x}^{- 1} cov [E {z | y}]$ ; the kernel matrix of DR is: $\begin{aligned} M_{dr}^{⋆} & = 2 Σ_{x}^{- 1} (E [E^{2} (z z^{T} | y)] + E^{2} [E (z | y) E (z^{T} | y)] \\ + E [E (z^{T} | y) E (z | y)] E [E (z | y) E (z^{T} | y)] - I_{p}), \end{aligned}$ where $z = x - E (x)$ and $Σ_{x}$ the covariance matrix of all predictors $x$ ; the kernel matrix of DT-SIR (Lin et al. Citation2018) is $M_{DT - sir}^{⋆} = Σ_{x_{DT}}^{- 1} cov [E {(x_{DT} - E x_{DT}) | y}],$ where $x_{DT}$ is the subset of random variables in $x$ after the diagonal thresholding.

Given the finite sample size n, we have an empirical version of such kernel matrices $\hat{M}$ , by replacing population covariance matrix $Σ_{x}$ with sample covariance matrix ${\hat{Σ}}_{x}$ , and all means $E (\cdot)$ with empirical mean $E_{n} (\cdot)$ . For example, ${\hat{M}}_{sir} = {\hat{Σ}}_{x}^{- 1} cov [E_{n} {z | y}]$ . Li (Citation2007) shows these SDR methods are equivalent to solving the leading eigenvector of the empirical version of such kernel matrices.

After estimating the kernel matrix $\hat{M}$ , we study the sparse estimation of the leading SDR direction, i.e. the largest eigenvector of $\hat{M}$ . We use the well-known fact: solving the largest eigenvector $θ_{1}$ is equivalent to trace optimisation (Wold, Esbensen, and Geladi Citation1987) as follows: $max_{θ} tr (θ^{T} \hat{M} θ) s . t . ‖ θ ‖_{2}^{2} \leq 1.$ To impose the sparsity, we introduce $ℓ_{0}$ constraint in the following trace maximisation to penalise the cardinality of non-zero elements in vector $θ$ : ${\hat{θ}}_{1} = \underset{θ}{\arg max} tr (θ^{T} \hat{M} θ) s . t . ‖ θ ‖_{2}^{2} \leq 1, ‖ θ ‖_{0} \leq t$ Note that $span (θ^{T} x) = span (θ θ^{T} x)$ . With $Θ = θ θ^{T}$ , then it is equivalent to solving the positive semidefinite estimate ${\hat{Θ}}_{1}$ from $max_{Θ \in R^{p \times p}} tr (\hat{M} Θ) s . t . tr (Θ) = 1; 1^{T} ‖ Θ ‖_{0} 1 \leq t^{2}; Θ ⪰ 0; rank (Θ) = 1$ where $‖ Θ ‖_{0}$ is defined as the element-wise $ℓ_{0}$ norm. When $Θ ⪰ 0$ and $rank (Θ) = 1$ , $tr (Θ) = 1$ implies $‖ θ ‖_{2}^{2} = 1$ , and $1^{T} ‖ Θ ‖_{0} 1 \leq t^{2}$ is equivalent to $‖ θ ‖_{0} \leq t$ . However, the rank-one constraint that $rank (Θ) = 1$ is hard to solve efficiently in practice. Therefore, we consider a relaxation of removing this constraint. Instead, we solve the dominant eigenvector of ${\hat{Θ}}_{1}$ as ${\hat{θ}}_{1}$ , which is the SDR direction of our interest. Generally, this relaxation tends to have a solution very close to rank one matrix (d'Aspremont et al. Citation2005).

After removing the rank-one constraint, it is equivalent to solve ${\hat{Θ}}_{1}$ from the following $ℓ_{0}$ penalised estimation problem for some sufficiently large γ: (1) $max_{Θ \in R^{p \times p}} tr (\hat{M} Θ) - γ 1^{T} ‖ Θ ‖_{0} 1 s . t . tr (Θ) = 1; Θ ⪰ 0,$ (1) where given ${\hat{Θ}}_{1}$ , we follow d'Aspremont et al. (Citation2005) to truncate it to obtain the dominant (sparse) eigenvector of ${\hat{θ}}_{1}$ , and it will be the leading sparse SDR direction of our interest.

In the sequel, given ${\hat{θ}}_{1}, \dots, {\hat{θ}}_{k}$ , we solve the following sparse SDR direction ${\hat{θ}}_{k + 1}$ in a sequential fashion. We denote ${\hat{M}}_{1} = \hat{M}$ , and for $k \geq 1$ , let ${\hat{M}}_{k + 1}$ be the projection of ${\hat{M}}_{k}$ on the space orthogonal to previous SDR directions. To remove the effect of previous eigenvectors, ${\hat{θ}}_{k + 1}$ is estimated as the largest (sparse) eigenvector of matrix ${\hat{M}}_{k + 1}$ , which is a Hotelling's deflation from previous vectors: ${\hat{M}}_{k + 1} = {\hat{M}}_{k} - ({\hat{θ}}_{k}^{T} {\hat{M}}_{k} {\hat{θ}}_{k}) {\hat{θ}}_{k} {\hat{θ}}_{k}^{T} .$ Mackey (Citation2009) studied the property of Hotelling's deflation in this type of sparse eigenvector estimation problem. Especially when we estimate the sparse eigenvector on the true support, this deflation helps eliminate the influence of a given eigenvector by setting the associated eigenvalue to zero and achieve ${\hat{θ}}_{1} ⊥ ⊥ {\hat{θ}}_{2} ⊥ ⊥ \dots ⊥ ⊥ {\hat{θ}}_{k}$ . For the sparse estimation of ${\hat{θ}}_{k + 1}$ , we plug in ${\hat{M}}_{k + 1}$ to (Equation1(1) $max_{Θ \in R^{p \times p}} tr (\hat{M} Θ) - γ 1^{T} ‖ Θ ‖_{0} 1 s . t . tr (Θ) = 1; Θ ⪰ 0,$ (1) ), then solve $θ_{k + 1}$ from: $max_{θ \in R^{p}} tr (θ^{T} {\hat{M}}_{k + 1} θ) s . t . ‖ θ ‖_{2}^{2} \leq 1, ‖ θ ‖_{0} \leq t,$ which is further equivalent to the following constrained problem following the same rank relaxation: (2) $max_{Θ \in R^{p \times p}} tr ({\hat{M}}_{k + 1} Θ) - γ 1^{T} ‖ Θ ‖_{0} 1 s . t . tr (Θ) = 1; Θ ⪰ 0.$ (2) (Equation2(2) $max_{Θ \in R^{p \times p}} tr ({\hat{M}}_{k + 1} Θ) - γ 1^{T} ‖ Θ ‖_{0} 1 s . t . tr (Θ) = 1; Θ ⪰ 0.$ (2) ) is still a nonconvex problem since $ℓ_{0}$ constraint is included. To solve the computational challenge, one existing relaxation is using folded concave penalty to approximate $ℓ_{0}$ constraint. Let $P_{λ} (| Θ |)$ be a folded concave penalty function such as the SCAD (Fan and Li Citation2001) and MCP (Zhang Citation2010), where $| Θ |$ is the matrix such that taking the absolute value of matrix $Θ$ element-wisely.

Specifically, in this paper, we consider the element-wise SCAD penalty $P_{δ} (t)$ as: $P_{δ} (t) = {\begin{cases} \frac{2 at}{(a + 1) δ} & | t | \leq \frac{δ}{a}, \\ 1 - \frac{(δ - | t |)^{2}}{(1 - \frac{1}{a^{2}}) δ^{2}} & \frac{δ}{a} \leq | t | \leq δ, \\ 1 & | t | \geq δ . \end{cases}$ where a is a constant, usually chosen as 3.7 in the SCAD penalty. In Section 5, we discuss how such penalisation can well approximate $ℓ_{0}$ penalisation. In this paper, We use SCAD penalty in this paper for simplicity. It is worth mentioning that we can also apply other nonconvex penalties and approximations here, such as MCP (Zhang Citation2010) and coefficient thresholding (Liu, Zhang, Xue, Song, and Kang Citation2024) to achieve the same theoretical property and similar numerical performance.

With this approximation, we solve ${\hat{Θ}}_{1}$ from the following folded concave penalised estimation of sparse SDR: (3) $max_{Θ \in R^{p \times p}} tr (\hat{M} Θ) - λ P_{δ} (| Θ |) s . t . tr (Θ) = 1; Θ ⪰ 0,$ (3) and then for $k = 1, 2, \dots, d - 1$ , sequentially solve ${\hat{Θ}}_{k + 1}$ , given ${\hat{M}}_{k + 1}$ , from $max_{Θ \in R^{p \times p}} tr ({\hat{M}}_{k + 1} Θ) - λ P_{δ} (| Θ |) s . t . tr (Θ) = 1; Θ ⪰ 0.$ Without loss of generality, this idea of estimating leading sparse eigenvalues can also be applied to the sparse Principal Component Analysis (PCA) problem.

Theorem 4 will provide the computational guarantee for the above approximation to $ℓ_{0}$ penalisation. Section 6 will also present the numerical performance.

3. Methodology

The methodology and theory of the sparse SDR depend on the linearity condition and constant variance condition, which usually hold when predictors follow elliptical distributions, but elliptical conditions can be violated in many cases. In this section, following the framework built by Fukumizu et al. (Citation2009) and Li, Chun, and Zhao (Citation2012), we extend the framework to achieve sparse KSDR, which does not require linearity and constant variance conditions. The key concept is that we construct a conditional covariance operator in reproducing kernel Hilbert space (RKHS), which evaluates the amount of information y that cannot be explained by the function of linear combinations of $x$ . Then, we derive the empirical estimator of such conditional covariance operator with finite samples and finally minimise the trace of such estimator with sparsity constraints.

First, we provide the necessary definitions in RKHS to construct the conditional covariance operator. Let $k_{x} : Ω_{x} \times Ω_{x} \to R$ and $k_{y} : Ω_{y} \times Ω_{y} \to R$ be positive definite kernel functions and $H_{x}$ and $H_{y}$ be the RKHS generated by $k_{x}$ and $k_{y}$ , such that $H_{x}$ satisfies reproducing property: $\begin{aligned} \forall x \in Ω_{x}, k (\cdot, x) \in H_{x}; \\ \forall x \in Ω_{x}, ∀f \in H_{x}, 〈 f, k (\cdot, x) 〉_{H_{x}} = f (x) . \end{aligned}$ And $H_{y}$ satisfies such reproducing property accordingly in space $H_{y}$ . Then define $Σ_{xx} : H_{x} \to H_{x}$ be the variance operator of $x$ , which satisfies the relation $〈 f, Σ_{xx} g 〉_{H_{x}} = cov (f (x), g (x))$ for all $f, g \in H_{x}$ , and so as $Σ_{yy}$ . Define $Σ_{y x} : H_{x} \to H_{y}$ as the covariance operator satisfying $〈 g, Σ_{y x} f 〉_{H_{y}} = cov (g (y), f (x))$ for all $g \in H_{y}$ and $f \in H_{x}$ , and $Σ_{x y} = Σ_{y x}^{*}$ be the adjoint operator of $Σ_{yx}$ . Based on Baker (Citation1973), we know there exists unique bounded operators $V_{y x}$ and $V_{x y}$ satisfy: $Σ_{y x} = Σ_{yy}^{1 / 2} V_{y x} Σ_{xx}^{1 / 2}; Σ_{x y} = Σ_{xx}^{1 / 2} V_{x y} Σ_{yy}^{1 / 2} .$ Then, the conditional covariance operator can be defined as: $Σ_{yy | x} = Σ_{yy} - Σ_{yy}^{1 / 2} V_{y x} V_{x y} Σ_{yy}^{1 / 2} .$ For convenience, we sometime write it as: $Σ_{yy | x} = Σ_{yy} - Σ_{y x} Σ_{xx}^{†} Σ_{x y} .$ where $Σ_{xx}^{†}$ is the pseudo inverse of $Σ_{xx}$ . Specifically, $Σ_{xx}^{†} = (Σ_{xx} + ϵ_{s} I)^{- 1}$ , as the Tychonoff regularised inverse matrix of $Σ_{xx}^{†}$ , given some sufficient small constant $ϵ_{s}$ .

Also, define the sparse orthogonal space $S_{d}^{p} (R) = {Θ \in M (p \times d; R)) | Θ^{T} Θ = I_{d}; ‖ Θ_{\cdot i} ‖_{0} \leq k$ for i=1,…,d $}$ , representing the leading d sparse SDR directions. When d = 1, it represents the leading sparse SDR vector.

In what follows, we study the sparse estimation of $Θ \in S_{d}^{p} (R)$ such that $y ⊥ ⊥ x | Θ^{T} x$ . To simplify notation, we introduce $S = Θ^{T} x$ , and $k_{s}$ , $Σ_{ss}$ , $Σ_{y s}$ and $Σ_{s y}$ are defined in the same manner as $k_{x}$ , $Σ_{xx}$ , $Σ_{y x}$ and $Σ_{x y}$ . Then we have: $Σ_{yy | s} = Σ_{yy} - Σ_{y s} Σ_{ss}^{†} Σ_{s y}$ As shown in Theorem 4 of Fukumizu et al. (Citation2009), if $H_{x}$ and $H_{y}$ are rich enough RKHS, $y ⊥ ⊥ x | Θ^{T} x ⟺ Σ_{yy | s} = Σ_{yy | x},$ and also for any $Θ \in S_{d}^{p} (R)$ , we have $Σ_{yy | Θ^{T} x} \geq Σ_{yy | x} .$ This conclusion holds for our sparse orthogonal space as well, under the sparsity assumption of SDR vectors.

This result indicates that such a conditional covariance operator is a good estimator for sufficient dimension reduction at the population level. However, the population-level operator is not available in practice. At the sample level, we need to derive the empirical version of such an operator in coordinate representation and minimise it. To start with, we have the coordinate representation of the kernel matrix as: $K_{x} = (\begin{matrix} k_{x} (x_{1}, x_{1}) & \dots & k_{x} (x_{1}, x_{n}) \\ ⋮ & ⋱ & ⋮ \\ k_{x} (x_{n}, x_{1}) & \dots & k_{x} (x_{n}, x_{n}) \end{matrix}) .$ Let $G_{x} = Q K_{x} Q$ be the centralised gram matrix, where $Q = I_{n} - 1_{n} 1_{n}^{T} / n$ . Similarly, we define $K_{y}, K_{s}, G_{y}, G_{s}$ . Following Lemma 1 of Li et al. (Citation2012), under the spanning systems: $\begin{aligned} {\hat{H}}_{X} & = span {k_{x} (\cdot, x_{i}) - E_{n} k_{x} (\cdot, x) : i = 1, \dots, n}, \\ {\hat{H}}_{Y} & = span {k_{y} (\cdot, y_{i}) - E_{n} k_{y} (\cdot, y) : i = 1, \dots, n} . \end{aligned}$ The variance and covariance operators can be written as: $\begin{aligned} {\hat{Σ}}_{xx} = G_{x}, {\hat{Σ}}_{y x} = G_{x}, {\hat{Σ}}_{x y} = G_{y}, {\hat{Σ}}_{yy} = G_{y} \\ {\hat{Σ}}_{ss} = G_{s}, {\hat{Σ}}_{s y} = G_{y}, {\hat{Σ}}_{y s} = G_{s} . \end{aligned}$ Placing these coordinate representations into $Σ_{yy | s}$ , we derive the coordinate representation of ${\hat{Σ}}_{yy | s}$ as: ${\hat{Σ}}_{yy | s} = G_{y} - G_{s} G_{s}^{‡} G_{y},$ where $G_{s}^{‡} = (G_{s} + ϵ_{s} I)^{- 1}$ is the Tychonoff regularised inverse matrix of $G_{s}$ given some sufficient small constant $ϵ_{s}$ . Then we minimise the trace of $tr ({\hat{Σ}}_{yy | Θ^{T} x})$ to estimate sparse SDR directions from $Θ \in S_{d}^{p} (R)$ .

To deal with the orthogonality constraint in $S_{d}^{p} (R)$ , we use a similar sequential procedure as in Section 2. To this end, we solve ${\hat{θ}}_{1}$ from the following trace minimisation to minimise the sum of conditional variance: $min_{θ \in {SP}_{1}^{p} (R)} tr ({\hat{Σ}}_{yy | θ^{T} x})$ where ${SP}_{1}^{p} (R) = {θ | θ \in R^{p}, ‖ θ ‖_{2} = 1; ‖ θ ‖_{0} \leq t} .$ It is worth pointing out the equivalence between the above trace minimisation problem and the following regularised nonparametric regression problem: $\begin{aligned} min_{Θ \in S_{1}^{p} (R)} min_{f \in H_{X}^{S}} \frac{1}{n} \sum_{i = 1}^{n} \sum_{a = 1}^{\infty} E | ξ_{a} (Y_{i}) - \frac{1}{n} \sum_{j = 1}^{n} (ξ_{a} (Y_{j})) - (f (X_{i}) \\ - \frac{1}{n} \sum_{j = 1}^{n} [f (X_{j})]) |^{2}} + \frac{ϵ_{n}}{n} ‖ f ‖_{H_{X}^{S}}^{2} \end{aligned}$ This equivalence provides insight into sparse KSDR, and it is also useful to establish the theoretical properties in Section 5.

In sparse KSDR, we assume y is the function of $θ_{1}^{⋆ T} x, \dots, θ_{d}^{⋆ T} x$ . In this case, $θ_{1}^{⋆}, \dots, θ_{d}^{⋆}$ are the true sparse SDR vectors of our interest. Previously, we have introduced how to estimate $θ_{1}^{⋆}$ . Here, we introduce how to solve these non-leading SDR directions sequentially.

Given that we have already estimated orthogonal sufficient directions ${\hat{θ}}_{1}, \dots, {\hat{θ}}_{k}$ for $k \geq 1$ , we sequentially solve the following KSDR direction $θ^{k + 1}$ from the minimisation of $tr (Σ_{yy | {\hat{θ}}_{1}^{T} x, \dots, {\hat{θ}}_{k}^{T} x, θ^{T} x})$ . By orthogonality of all sufficient directions, $tr (Σ_{yy | {\hat{θ}}_{1}^{T} x, \dots, {\hat{θ}}_{k}^{T} x, θ^{T} x}) = tr (Σ_{yy | ({\hat{θ}}_{1} {\hat{θ}}_{1}^{T} + \dots + {\hat{θ}}_{k} {\hat{θ}}_{k}^{T} + θ θ^{T}) x})$ . Hence, we only need to solve ${\hat{θ}}_{k + 1}^{init} = min_{θ \in {SP}_{1}^{p} (R)} tr (Σ_{yy | ({\hat{θ}}_{1} {\hat{θ}}_{1}^{T} + \dots + {\hat{θ}}_{k} {\hat{θ}}_{k}^{T} + θ θ^{T}) x}),$ Then we project ${\hat{θ}}_{k + 1}^{init}$ to the space orthogonal to all previous KSDR vectors, denoted as the final estimator ${\hat{θ}}_{k + 1}$ .

Here, we justify why this projection procedure is correct. Define $θ_{p} = P_{{\hat{θ}}_{1}, \dots . {\hat{θ}}_{k}}^{⊥} θ$ as the projection of $θ$ on the direction orthogonal to all previous k KSDR vectors, such that $span ({\hat{θ}}_{1}, \dots, {\hat{θ}}_{k}, θ) = span ({\hat{θ}}_{1}, \dots, {\hat{θ}}_{k}, θ_{p})$ . By definition of the conditional covariance operator, we have: $tr (Σ_{yy | {\hat{θ}}_{1}^{T} x, \dots, {\hat{θ}}_{k}^{T} x, θ^{T} x}) = tr (Σ_{yy | {\hat{θ}}_{1}^{T} x, \dots, {\hat{θ}}_{k}^{T} x, θ_{p}^{T} x}) .$ Since the non-orthogonal direction to previous KSDR directions plays no effect in reducing the trace, we only need to consider the projection $θ_{p} = P_{{\hat{θ}}_{1}, \dots . {\hat{θ}}_{k}}^{⊥} θ$ for the sparse estimation of the following KSDR direction $θ^{k + 1}$ . By the projection procedure, we guarantee that ${\hat{θ}}_{1} ⊥ ⊥ {\hat{θ}}_{2} ⊥ ⊥ \dots ⊥ ⊥ {\hat{θ}}_{k}$ .

Finally, we discuss solving $ℓ_{0}$ constraint and rank one constraint. Recall that $P_{λ} (| Θ |)$ is a folded concave penalty. In practice, we solve the following folded concave relaxation of sparse KSDR to approximate the $ℓ_{0}$ constraint inside ${SP}_{1}^{p} (R)$ , and solve ${\hat{Θ}}_{1}$ from $min_{Θ \in R^{p \times p}} tr (Σ_{yy | Θ^{T} x}) s . t . tr (Θ) = 1; P_{δ} (| Θ |) \leq t^{'}; Θ ⪰ 0,$ which can be rewritten in terms of a folded concave penalised estimation problem: $min_{Θ \in R^{p \times p}} tr (Σ_{yy | Θ^{T} x}) + λ P_{δ} (| Θ |) s . t . tr (Θ) = 1; Θ ⪰ 0.$ Given ${\hat{Θ}}_{1}$ , we follow Section 2 to truncate it to obtain the first KSDR direction ${\hat{θ}}_{1}$ . Next, in the sequential fashion, given ${\hat{θ}}_{1}, \dots, {\hat{θ}}_{k}$ for $k \geq 1$ , we solve ${\hat{Θ}}_{k + 1}$ from $min_{Θ \in R^{p \times p}} tr (Σ_{yy | ({\hat{θ}}_{1} {\hat{θ}}_{1}^{T} + \dots + {\hat{θ}}_{k} {\hat{θ}}_{k}^{T} + Θ^{T}) x}) + P_{λ} (| Θ |) s . t . tr (Θ) = 1; Θ ⪰ 0,$ where we used the fact that $tr (Σ_{yy | {\hat{θ}}_{1}^{T} x, \dots, {\hat{θ}}_{k}^{T} x, θ^{T} x}) = tr (Σ_{yy | {\hat{θ}}_{1}^{T} x, \dots, {\hat{θ}}_{k}^{T} x, θ_{p}^{T} x})$ . Now, we compute the projection $Θ_{k + 1}^{proj} = P_{θ_{1} \oplus \dots \oplus θ_{k}}^{⊥} {\hat{Θ}}_{k + 1}$ and solve the dominant (sparse) eigenvector as the desired sparse KSDR direction ${\hat{θ}}_{k + 1}$ .

4. Nonconvex optimisation

Section 4 first proposes an efficient nonconvex ADMM to solve the nonconvex optimisation of sparse SDR and then presents an efficient nonconvex linearised ADMM to solve the even more challenging nonconvex optimisation of sparse KSDR. Both nonconvex ADMMs enjoy the important computational guarantees, which will be presented in Section 5.

4.1. Nonconvex ADMM for sparse SDR

In this section, we propose a novel nonconvex ADMM to solve the challenging nonconvex and nonsmooth optimisation in the sparse SDR. The proposed nonconvex ADMM enjoys the convergence guarantees to the ‘ϵ-stationary solution’ (see the definition in Theorem 5) with an explicit iteration complexity bound. The convergence guarantee of nonconvex algorithms can be defined as varying from algorithms. The ϵ-stationary solution introduced in Jiang, Lin, Ma, and Zhang (Citation2019) is an applicable one to depict the convergence in the aspect of the Lagrangian function. More details about the ϵ-stationary solution and computational guarantees can be found in Theorem 5 of Section 5.1.

To begin with, we introduce a new variable $Ω$ and also a slacking variable $Φ$ . By adding the equality constraint $Θ - Ω - Φ = 0$ , (Equation3(3) $max_{Θ \in R^{p \times p}} tr (\hat{M} Θ) - λ P_{δ} (| Θ |) s . t . tr (Θ) = 1; Θ ⪰ 0,$ (3) ) is equivalent to: $min_{Θ \in R^{p \times p}} - tr (\hat{M} Θ) + P_{λ} (| Θ |) + \frac{μ}{2} ‖ Φ ‖_{F}^{2} s . t . tr (Ω) = 1; Ω ⪰ 0, Θ - Ω - Φ = 0.$ The slacking variable $Φ$ plays an essential role in the convergence guarantee of our proposed nonconvex ADMM. Accordingly, the augmented Lagrange function is $\begin{aligned} L_{p} (Θ, Ω, Φ, Λ) & = - tr (\hat{M} Θ) + P_{λ} (| Θ |) + \frac{μ}{2} ‖ Φ ‖_{F}^{2} + I_{{tr (Ω) = 1; Ω ⪰ 0}} \\ + 〈 Λ, Θ - Ω - Φ 〉 + \frac{ρ}{2} ‖ Θ - Ω - Φ ‖_{F}^{2} . \end{aligned}$ where $Λ$ is the Lagrange variable and $ρ > 0$ . Given the initial value $(Ω^{0}, Φ^{0}, Λ^{0})$ , we solve $(Θ^{q + 1}, Ω^{q + 1}, Φ^{q + 1}, Λ^{q + 1})$ iteratively for $q = 0, 1, 2, \dots$ as follows: $\begin{aligned} Θ^{q + 1} & = \underset{Θ}{\arg min} L_{p} (Θ, Ω^{q}, Φ^{q}, Λ^{q}) + \frac{h}{2} ‖ Θ^{q} - Θ ‖_{F}^{2} \\ Ω^{q + 1} & = \underset{Ω}{\arg min} L_{p} (Θ^{q + 1}, Ω, Φ^{q}, Λ^{q}) \\ Φ^{q + 1} & = Φ^{q} - γ \frac{\partial}{\partial Φ^{q}} L_{p} (Θ^{q + 1}, Ω^{q + 1}, Φ, Λ^{q}) \\ Λ^{q + 1} & = Λ^{q} - ρ (Θ^{q + 1} - Ω^{q + 1}), \end{aligned}$ where h>0 is chosen to guarantee the convexity of $Θ$ step and γ & ρ are chosen to guarantee the convergence. Their choices will be discussed in Theorem 5. Note that the $Φ$ step and the $Λ$ step are straightforward. Fortunately, we also have the closed-form solutions in both $Θ$ step and $Ω$ step. In the $Θ$ step, we know that $\begin{aligned} min_{Θ} L_{p} (Θ, Ω^{q}, Φ^{q}, Λ^{q}) + \frac{h}{2} ‖ Θ^{q} - Θ ‖_{F}^{2} \\ = min_{Θ} \frac{ρ + h}{2} {‖ Θ - \frac{1}{ρ + h} (ρ Ω^{q} + ρ Φ^{q} + h Θ^{q} - Λ^{q} - \hat{M}) ‖}_{F}^{2} + P_{λ} (| Θ |), \end{aligned}$ which can be analytically solved by using the univariate folded concave thresholding rule $S_{λ} (b, c) = \arg min_{u} \frac{1}{2} ‖ u - b ‖^{2} + c P_{λ} (| u |) .$ More specifically, we have $Θ^{q + 1} = S_{λ} (\frac{1}{ρ + h} (ρ Ω^{q} + ρ Φ^{q} + h Θ^{q} - Λ^{q} - \hat{M}), \frac{ρ + h}{2}) .$ In the $Ω$ step, it is easy to see that $min_{Ω} L_{p} (Θ^{q + 1}, Ω, Φ^{q}, Λ^{q}) = min_{Ω ⪰ 0 : tr (Ω) = 1} ‖ Ω + Φ^{q} - Θ^{q + 1} - \frac{1}{ρ} Λ^{q} ‖_{F}^{2},$ which can be solved by using the projection onto the simplex in the Euclidean space (See Algorithm 1 of Ma Citation2013). To solve $Ω^{q + 1}$ , we take the singular value decomposition of $W^{q} = Θ^{q + 1} - Φ^{q} + \frac{1}{ρ} Λ^{q}$ as $W^{q} = U diag (σ_{1}, \dots, σ_{p}) U^{T}$ . After analytically solving the projection of $σ$ onto the simplex in the Euclidean space. Denoting $\hat{ξ} = ({\hat{ξ}}_{1}, \dots, {\hat{ξ}}_{p})$ , it has closed-form solution: $\hat{ξ} = \arg min_{ξ} \frac{1}{2} ‖ ξ - σ ‖_{2}^{2} s . t . \sum_{j = 1}^{p} ξ_{j} = t, ξ_{j} \geq 0.$ Then we obtain the closed-form solution $Ω^{q + 1} = U diag ({\hat{ξ}}_{1}, \dots, {\hat{ξ}}_{p}) U^{T}$ .

In the $Φ$ step, we solve the derivative as: $\frac{\partial}{\partial Φ^{q}} L_{p} (Θ^{q + 1}, Ω^{q + 1}, Φ, Λ^{q}) = μ Φ - Λ + ρ (Φ + Ω - Θ) .$ Now, we summarise the proposed nonconvex ADMM in Algorithm 1.

4.2. Nonconvex linearised ADMM for sparse KSDR

In this section, we provide a detailed derivation of the nonconvex linearised ADMM algorithm for sparse KSDR. We present the algorithm for solving the leading KSDR vector, and the algorithm for the following vectors follows similarly. To begin with, we introduce new variables $Ω_{1}$ and $Ω_{2}$ and slacking variables $Φ_{1}$ and $Φ_{2}$ . By adding two equality constraints that $Θ - Ω_{1} - Φ_{1} = 0, Θ - Ω_{2} - Φ_{2} = 0$ , (2.7) is equivalent to: $\begin{aligned} min_{(Θ, Ω_{1}, Ω_{2}, Φ_{1}, Φ_{2})} tr ({\hat{Σ}}_{yy | Θ^{T} x}) + P_{λ} (| Ω_{1} |) + \frac{μ}{2} (‖ Φ_{1} ‖_{F}^{2} + ‖ Φ_{2} ‖_{F}^{2}) \\ s . t Θ - Ω_{1} - Φ_{1} = 0, Θ - Ω_{2} - Φ_{2} = 0, tr (Ω_{2}) = 1, Ω_{2} ⪰ 0. \end{aligned}$ The slacking variables $Φ_{1}$ and $Φ_{2}$ are essential for the convergence guarantee of our proposed algorithm. Accordingly, the augmented Lagrange function is $\begin{aligned} L_{ρ} (Θ, Ω_{1}, Ω_{2}, Λ_{1}, Λ_{2}) & = tr ({\hat{Σ}}_{yy | Θx}) + I_{{tr (Ω_{2}) = 1; Ω_{2} ⪰ 0}} + \frac{μ}{2} (‖ Φ_{1} ‖_{F}^{2} + ‖ Φ_{2} ‖_{F}^{2}) \\ + P_{λ} (| Ω_{1} |) + 〈 Λ_{1}, Θ - Ω_{1} - Φ_{1} 〉 + 〈 Λ_{2}, Θ - Ω_{2} - Φ_{2} 〉 \\ + \frac{ρ}{2} ‖ Θ - Ω_{1} - Φ_{1} ‖_{F}^{2} + \frac{ρ}{2} ‖ Θ - Ω_{2} - Φ_{2} ‖_{F}^{2}, \end{aligned}$ where $Λ_{1}$ and $Λ_{2}$ are the Lagrange variable and $ρ > 0$ .

It is important to point out that $tr ({\hat{Σ}}_{yy | Θ^{T} x})$ introduces the additional computational challenges since the $Θ$ step does not lead to a closed-form solution. To this end, we introduce a linearised counterpart of this trace function as $tr ({\hat{Σ}}_{yy | Θ^{(q) T} x}) + (Θ - Θ^{q})^{T} \frac{\partial}{\partial Θ} tr ({\hat{Σ}}_{yy | Θ^{(q) T} x}) .$ After plugging this linearised term, we denote the new Lagrange function as $L_{ρ}^{*} (Θ, Θ^{q}, Ω_{1}, Ω_{2}, Φ_{1}, Φ_{2}, Λ_{1}, Λ_{2}) .$ Given the initial solution $(Ω_{1}^{0}, Ω_{2}^{0}, Φ_{1}^{0}, Φ_{2}^{0}, Λ_{1}^{0}, Λ_{2}^{0})$ , we solve $(Θ^{q + 1}, Ω_{1}^{q + 1}, Ω_{2}^{q + 1}, Φ_{1}^{q + 1}, Φ_{2}^{q + 1}, Λ_{1}^{q + 1}, Λ_{2}^{q + 1})$ iteratively for $q = 0, 1, 2, \dots$ as follows: $\begin{aligned} Θ^{q + 1} & = \underset{Θ}{\arg min} L_{ρ}^{*} (Θ, Ω_{1}^{q}, Ω_{2}^{q}, Φ_{1}^{q}, Φ_{2}^{q}, Λ_{1}^{q}, Λ_{2}^{q}) + \frac{h}{2} ‖ Θ - Θ^{q} ‖_{F}^{2} \\ Ω_{1}^{q + 1} & = \underset{Ω_{1}}{\arg min} L_{ρ} (Θ^{q + 1}, Ω_{1}, Ω_{2}^{q}, Φ_{1}^{q}, Φ_{2}^{q}, Λ_{1}^{q}, Λ_{2}^{q}) \\ Ω_{2}^{q + 1} & = \underset{Ω_{2}}{\arg min} L_{ρ} (Θ^{q + 1}, Ω_{1}^{q + 1}, Ω_{2}, Φ_{1}^{q}, Φ_{2}^{q}, Λ_{1}^{q}, Λ_{2}^{q}) \\ Φ_{1}^{q + 1} & = Φ_{1}^{q} - γ \frac{\partial}{\partial Φ_{1}^{q}} L_{ρ} (Θ^{q + 1}, Ω_{1}^{q + 1}, Ω_{2}^{q + 1}, Φ_{1}^{q}, Φ_{2}^{q}, Λ_{1}^{q}, Λ_{2}^{q}) \\ Φ_{2}^{q + 1} & = Φ_{2}^{q} - γ \frac{\partial}{\partial Φ_{2}^{q}} L_{ρ} (Θ^{q + 1}, Ω_{1}^{q + 1}, Ω_{2}^{q + 1}, Φ_{1}^{q}, Φ_{2}^{q}, Λ_{1}^{q}, Λ_{2}^{q}) \\ Λ_{1}^{q + 1} & = Λ_{1}^{q} - ρ (Θ^{q + 1} - Ω_{1}^{q + 1}) \\ Λ_{2}^{q + 1} & = Λ_{2}^{q} - ρ (Θ^{q + 1} - Ω_{2}^{q + 1}), \end{aligned}$ where h>0 is chosen to guarantee the convexity of $Θ$ step and $γ, ρ$ are chosen to guarantee the algorithmic convergence (see Theorem 10). In what follows, we present the closed-form solution for each subproblem.

For $Θ$ step, with the linearisation, $Θ^{q + 1}$ can be solved by taking derivatives as: $\frac{\partial}{\partial Θ} L_{ρ}^{*} (Θ, Θ^{q}, Ω_{1}^{q}, Ω_{2}^{q}, Φ_{1}^{q}, Φ_{2}^{q}, Λ_{1}^{q}, Λ_{2}^{q}) + h (Θ - Θ^{q}) = 0.$ Defining $g (Θ) = G_{s} (G_{s} + ϵ_{n} I)^{- 1}$ , then $tr ({\hat{Σ}}_{yy | Θx}) = tr (g (Θ) G_{y})$ , then the derivatives can be solved by using the chain rule of matrix derivatives: $\frac{\partial}{\partial Θ} tr ({\hat{Σ}}_{yy | Θx}) = vec {(\frac{\partial}{\partial Θ} g (Θ))}^{T} vec (\frac{\partial}{\partial g (Θ)} tr ({\hat{Σ}}_{yy | Θx})),$ which has a complex closed form. For space consideration, the detailed calculation of closed-form derivatives and its numerical approximation is deferred to the supplement. Based on $\frac{\partial}{\partial Θ} tr (Σ_{yy | Θx})$ , we can obtain the close-form derivative. By solving the linearisation, we obtain the minimiser: $\begin{aligned} q_{1} (Θ, Ω_{1}, Ω_{2}, Φ_{1}, Φ_{2}, Λ_{1}, Λ_{2}) \\ = \frac{ρ (Ω_{1} + Φ_{1} + Ω_{2} + Φ_{2}) - Λ_{1} - Λ_{2} - \frac{\partial}{\partial g (Θ)} tr ({\hat{Σ}}_{yy | Θx}) + h Θ^{q}}{2 ρ + h} . \end{aligned}$ For $Ω_{1}$ step, it is equivalent to solve the subproblem $\begin{aligned} min_{Ω_{1}} P_{λ} (Θ) + 〈 Λ_{1}, Θ - Ω_{1} - Φ_{1} 〉 + \frac{ρ}{2} ‖ Θ - Ω_{1} - Φ_{1} ‖_{F}^{2} \\ = min_{Ω_{1}} ‖ Ω_{1} - (Θ - Φ_{1} + \frac{Λ_{1}}{ρ}) ‖_{F}^{2} + P_{λ} (| Ω_{1} |) . \end{aligned}$ This can be efficiently solved by using the univariate penalised least squares solution: $Ω_{1}^{q + 1} = S_{λ} (Θ - Φ_{1} + \frac{Λ_{1}}{ρ}, 2),$ where $S_{λ}$ has been defined in Section 4.1. For $Ω_{2}$ step, we are solving the optimisation problem: $Ω_{2} = \underset{tr (Ω_{2}) = 1}{\arg min} - Λ_{2} Ω_{2} + \frac{ρ}{2} ‖ Θ - Ω_{2} - Φ_{2} ‖_{F}^{2} = \underset{tr (Ω_{2}) = 1}{\arg min} \frac{ρ}{2} ‖ Ω_{2} + Φ_{2} - Θ - Λ_{2} / ρ ‖_{F}^{2} .$ It can be solved by projection onto the simplex in the Euclidean space. We have presented exactly the same algorithm in Section 4.1. Thus, we omit it here. For $Φ_{1}$ and $Φ_{2}$ steps, we can solve them straightforwardly: $\begin{aligned} Φ_{1}^{q + 1} & = Φ_{1}^{q} + γ (μ Φ_{1}^{q} + Λ_{1} + ρ (Φ_{1}^{q} + Ω_{1}^{q + 1} - Θ^{q + 1})) \\ Φ_{2}^{q + 1} & = Φ_{2}^{q} + γ (μ Φ_{2}^{q} + Λ_{2} + ρ (Φ_{2}^{q} + Ω_{2}^{q + 1} - Θ^{q + 1})) . \end{aligned}$ Now we can summarise the proposed nonconvex linearised ADMM in Algorithm 2:

5. Theoretical properties

In this section, we establish a theoretical guarantee for both sparse SDR and sparse KSDR estimators. For both SDR and KSDR, first we study the statistical guarantee of $ℓ_{0}$ -constrained estimator, showing their consistency, allowing p diverging as $n \to \infty$ . Then, we study how our folded concave penalised estimator can approximate $ℓ_{0}$ -constrained estimator and prove the computation guarantee of our proposed nonconvex ADMM and its linearised version.

5.1. Asymptotic properties of sparse SDR

We study the statistical and computational properties of $ℓ_{0}$ -constrained SDR, such as the consistency and convergence rate in this subsection and the computational guarantees of folded concave penalised SDR in this subsection. The $ℓ_{0}$ -constrained SDR ${\hat{θ}}_{1} = \underset{θ}{\arg max} tr (θ^{T} \hat{M} θ) s . t . ‖ θ ‖_{2}^{2} \leq 1, ‖ θ ‖_{0} \leq t$ achieves the desired theoretical properties under the high-dimensional setting, while folded concave penalised SDR is computationally feasible and it asymptotically converges to $ℓ_{0}$ -constrained SDR. First, we define the true sufficient dimension vectors. For the selected method, d is the smallest integer such that, the top d eigenvectors of $M^{⋆}$ are $θ_{i}^{⋆}$ for $i = 1, \dots, d$ , such that $y ╨ x | θ_{i}^{⋆}$ for $i = 1, \dots, d$ . Then by definition, we know $θ_{1}^{⋆}, \dots, θ_{d}^{⋆}$ are true sufficient vectors. Now we introduce assumptions thm(A1)–thm(A3).

(A1)

$E (ξ^{T} x | θ_{1}^{T} x, \dots, θ_{d}^{T} x)$ is a linear function of $θ_{1}^{T} x$ , $\dots, θ_{d}^{T} x$ for any $ξ \in R^{p}$ .

(A2)

$cov (x | θ_{1}^{T} x, \dots, θ_{d}^{T} x)$ is degenerate.

(A3)

True sufficient dimension vectors $θ_{i}^{⋆}$ satisfy $‖ θ_{i}^{⋆} ‖_{0} \leq t$ for $i = 1, \dots, d$ .

Note that thm(A1), thm(A2), and thm(A3) are also called the linearity condition, the constant variance condition, and the sparsity condition, respectively. thm(A1) and thm(A2) hold when $x$ follows elliptical distributions. thm(A3) is a standard assumption in sparse SDR literature. See Li (Citation1991), Li and Wang (Citation2007), Li (Citation2007) and Luo et al. (Citation2022) for justifications of these conditions.

Let $\sin ∠ (θ_{a}, θ_{b}) = \sqrt{1 - (θ_{a}^{T} θ_{b})^{2}}$ . We first study the convergence rate of the first estimated sparse SDR direction ${\hat{θ}}_{1}$ . Theorem 1 derives this estimation bound based on the general estimation bound that $‖ \hat{M} - M^{⋆} ‖_{2} = O_{p} (τ (n, p))$ .

Theorem 5.1

Suppose (A1)–(A3), $‖ \hat{M} - M^{⋆} ‖_{2} = O_{p} (τ (n, p))$ , $λ_{1} > λ_{2}$ , $‖ θ_{1}^{⋆} ‖_{0} \leq t$ , and the spectral decomposition for $M^{⋆}$ exists. Then we have: $\sin ∠ ({\hat{θ}}_{1}, θ_{1}^{⋆}) = O_{p} (τ (n, p)) .$

In practice, $O_{p} (τ (n, p))$ has been derived for each specific SDR method. Corollary 1 discusses the explicit convergence rate, given the explicit convergence rate $O_{p} (τ (n, p))$ for different methods.

Corollary 5.2

Suppose (A1)–(A3), $λ_{1} > λ_{2}$ , $‖ θ_{1}^{⋆} ‖_{0} \leq t$ , covariance matrix $Σ_{x} = I_{p * p}$ , and $p = o (n^{1 / 2})$ as $n \to \infty$ , then for sparse SIR and sparse DR, we have: $\sin ∠ ({\hat{θ}}_{1}, θ_{1}^{⋆}) = O_{p} (p n^{- 1 / 2})$
Let $v \in [0, 1]$ corresponds the stability of the central curve $E [x | y]$ . When $p = o (n)$ , with proper choice of number of slices $H = (\frac{n}{p})^{\frac{1}{2 v + 2}}$ and (A3)–(A4) in Lin et al. (Citation2018) are satisfied, for sparse SIR, we have: $\sin ∠ ({\hat{θ}}_{1}, θ_{1}^{⋆}) \to 0 as n \to \infty .$ Further, if we have $Σ_{x} = I_{p * p}$ , we have convergence rate: $\sin ∠ ({\hat{θ}}_{1}, θ_{1}^{⋆}) = O_{p} ({(\frac{p}{n})}^{\frac{v}{2 v + 2}})$
Let $s = | S | ≪ p$ where $S = {i | θ_{j} (i) \neq 0 for some j, 1 \leq j \leq d}$ . Under the high-dimensional setting that $p ≫ n$ , by assuming (A3)–(A6) and (S1) in Lin et al. (Citation2018), and $Σ_{x} = I_{p * p}$ , then for the sparse DT-SIR, with proper choice of $H = (\frac{n}{p})^{\frac{1}{2 v + 2}}$ , we have convergence rate: $\sin ∠ ({\hat{θ}}_{1}, θ_{1}^{⋆}) = O_{p} ({(\frac{s}{n})}^{\frac{v}{2 v + 2}}) .$

where the stability v of the central curve $E [x | y]$ is defined in definition 1 of Lin et al. (Citation2019). We also provide the complete definition of it in the supplemental material.

It is worth mentioning that our framework is very flexible. By applying different SDR methods to estimate the the $M$ matrix, we can derive the convergence rate under different settings. Let $J^{⋆} = {j : θ_{1 j}^{⋆} \neq 0}$ is the support set of $θ_{1}^{⋆}$ . We further prove the variable selection consistency in Theorem 2.

Theorem 5.3

Under the assumption of Theorem 1, further if ${min}_{j \in J^{⋆}} | θ_{1 j}^{⋆} | \geq C_{0} τ (n, p)$ for some constant $C_{0} > 0$ , and $‖ θ_{1}^{⋆} ‖_{0} = t$ , then $supp ({\hat{θ}}_{1}) = supp (θ_{1}^{⋆})$ with probability tending to 1 as $n \to \infty$ .

Next, recall that in Section 2, for $k \geq 2$ , by the same optimisation procedure, we can further estimate ${\hat{θ}}_{k}$ from ${\hat{M}}_{k}$ . Next, we study the estimation bound and consistency for ${\hat{θ}}_{k}$ ( $k \geq 2$ ) in Theorem 3.

Theorem 5.4

Suppose that the conditions in Theorems 1 and 2 are satisfied, $‖ θ_{k}^{⋆} ‖_{0} \leq t$ , $λ_{k} > λ_{k + 1}$ and $λ_{k} = O (1)$ for $1 \leq k \leq d - 1$ . Then for $1 \leq k \leq d - 1$ , we have: $\begin{aligned} ‖ {\hat{M}}_{k + 1} - M_{k + 1}^{⋆} ‖_{2} & = O_{p} (τ (n, p)) \\ \sin ∠ ({\hat{θ}}_{k + 1}, θ_{k + 1}^{⋆}) & = O_{p} (τ (n, p)), \end{aligned}$ where ${\hat{M}}_{k + 1}$ and $M_{k + 1}^{⋆}$ are defined in Section 2.

Now, we provide the computational guarantees of our proposed nonconvex ADMM. First, we show that the folded concave penalised estimation well approximates the $ℓ_{0}$ penalisation. To this end, we use the rescaled folded concave penalty function, as we defined in Section 2. It is easy to see ${lim}_{δ \to 0} P_{δ} (t) = ‖ t ‖_{0}$ for any $t \in R$ . Thus, $P_{δ} (t)$ converges to $| t |_{0}$ in the $ℓ_{0}$ penalisation. For matrix $Θ$ , we define the penalisation $P_{δ} (| Θ |) = \sum_{i = 1}^{p} \sum_{j = 1}^{d} P (Θ_{ij})$ , which is the summation of univariate rescaled penalisation for all entries. Here, we reveal the connection between $ℓ_{0}$ penalised programming (Equation1(1) $max_{Θ \in R^{p \times p}} tr (\hat{M} Θ) - γ 1^{T} ‖ Θ ‖_{0} 1 s . t . tr (Θ) = 1; Θ ⪰ 0,$ (1) ) and scaled folded-concave penalised programming (Equation3(3) $max_{Θ \in R^{p \times p}} tr (\hat{M} Θ) - λ P_{δ} (| Θ |) s . t . tr (Θ) = 1; Θ ⪰ 0,$ (3) ) in the following theorem.

Theorem 5.5

Suppose we choose ${δ_{u}}$ to be sequence such that ${lim}_{u \to \infty} δ_{u} = 0^{+}$ . For the programming (Equation3(3) $max_{Θ \in R^{p \times p}} tr (\hat{M} Θ) - λ P_{δ} (| Θ |) s . t . tr (Θ) = 1; Θ ⪰ 0,$ (3) ), we choose penalisation function $P_{δ} (t)$ as we defined. Let $Θ^{u}$ be the global minimiser of (Equation3(3) $max_{Θ \in R^{p \times p}} tr (\hat{M} Θ) - λ P_{δ} (| Θ |) s . t . tr (Θ) = 1; Θ ⪰ 0,$ (3) ) when we choose $δ = δ_{u}$ , and assume $Θ^{\infty}$ is a limit point of ${Θ^{u}}$ . Then $Θ^{\infty}$ is also a global minimiser to (Equation1(1) $max_{Θ \in R^{p \times p}} tr (\hat{M} Θ) - γ 1^{T} ‖ Θ ‖_{0} 1 s . t . tr (Θ) = 1; Θ ⪰ 0,$ (1) ).

Since the nonconvex problem (Equation3(3) $max_{Θ \in R^{p \times p}} tr (\hat{M} Θ) - λ P_{δ} (| Θ |) s . t . tr (Θ) = 1; Θ ⪰ 0,$ (3) ) has widely varying curvature, it is nontrivial to prove the algorithmic convergence. By taking advantage of the structured nonconvexity, the next theorem shows that our proposed nonconvex ADMM converges to an ϵ-stationary solution with an explicit iteration complexity bound. In the following theorem, we use $vec (Θ)$ to represent the vectorisation of a matrix from $R^{p \times d}$ to a vector $R^{pd}$ .

Theorem 5.6

Let L be the Lipschitz bound of loss function, $ρ > max (\frac{18 \sqrt{3} + 6}{13} L, 6 hL)$ and $γ \in {\begin{cases} (\frac{\sqrt{13 ρ^{2} - 12 ρL - 72 L^{2}} - 2 ρ}{9 ρ^{2} - 12 ρL - 72 L^{2}}, + \infty) & ρ \in (\frac{2 + 2 \sqrt{19}}{3} L, + \infty) \\ (\frac{2 ρ - \sqrt{13 ρ^{2} - 12 ρL - 72 L^{2}}}{72 L^{2} + 12 ρL - 9 ρ^{2}}, \\ \frac{2 ρ + \sqrt{13 ρ^{2} - 12 ρL - 72 L^{2}}}{72 L^{2} + 12 ρL - 9 ρ^{2}}) & ρ \in (\frac{18 \sqrt{3} + 6}{13} L, \frac{2 + 2 \sqrt{19}}{3} L] \end{cases}$ For any $0 < ϵ < min {\frac{1}{L}, \frac{1}{3 h}}$ , with no more than $O (\frac{1}{ϵ^{4}})$ number of iterations, our proposed algorithm obtains an ϵ-stationary solution $(\hat{Θ}, \hat{Ω}, \hat{Λ})$ , as defined in Jiang et al. (Citation2019), for any matrix $Θ \in R^{p \times p}$ with finite Frobenius norm satisfying that: $\begin{aligned} (vec (Θ - \hat{Θ}))^{T} [vec (\hat{M} + \partial P_{λ} (| Θ |) - \hat{Λ} I)] \geq - ϵ \\ ‖ \hat{Θ} - \hat{Ω} ‖ \leq ϵ \\ tr (\hat{Ω}) = 1, \hat{Ω} ⪰ 0. \end{aligned}$

Remark: When $ϵ = 0$ , the constraint exactly implicates the KKT-condition. Thus, the solution will be a stationary solution. We will show numerical performance of such a near-stationary solution in the next section.

5.2. Asymptotic properties of sparse KSDR

In this section, we establish the consistency of kernel sparse dimension reduction under milder assumptions. Based on Fukumizu et al. (Citation2009), we assume following assumptions thm(A4)–thm(A7).

(A4)

For any bounded continuous function g on $Y$ , the mapping $Θ \to E_{x} [E_{y | Θ^{T} X} [g (y) | Θ^{T} X]^{2}]$ is continuous on $Θ \in S_{k}^{p} (R)$ for k=1,2,…,d.

(A5)

For any $k = 1, 2, \dots d$ , let $Θ \in S_{k}^{p} (R)$ , and $P_{Θ}$ be the probability distribution of the random variable $Θ^{T} x$ on $X$ . The Hilbert space $H_{X}^{S} + R$ is dense in $L^{2} (P_{Θ})$ for any $Θ \in S_{k}^{p} (R)$ .

(A6)

There is a measurable function $ϕ : X \to R$ such that $E | ϕ (x) |^{2} < \infty$ and the Lipschitz condition: $‖ k_{s} (Θ^{T} x, \cdot) - k_{s} ({\tilde{Θ}}^{T} x, \cdot) ‖_{H_{d}} \leq ϕ (x) D (Θ, \tilde{Θ})$ holds for all $Θ, \tilde{Θ} \in {SP}_{k}^{p} (R)$ and $x \in X$ for any k=1,2,…,d, and the kernel $k_{s} (x, \cdot)$ is defined in Section 3. D is a distance which is compatible with the topology of ${SP}_{k}^{p} (R)$ . And we assume $‖ Σ_{yy} ‖_{tr}, ‖ Σ_{yx} ‖_{HS} \leq \infty$ .

(A7)

The true sparse SDR directions $Θ^{⋆} = (θ_{1}^{⋆}, \dots, θ_{d}^{⋆})$ satisfies $Θ^{⋆} \in {SP}_{d}^{p} (R)$ .

Note that thm(A1) and thm(A2) are not required for the sparse KSDR, and we rewrite the sparsity assumption thm(A3) in our defined sparse orthogonal space as thm(A7). In thm(A4) to thm(A6), we introduce some mild regularity conditions in RKHS, which are well justified in Fukumizu et al. (Citation2009).

Theorem 5.7

Under assumption (A4) to (A7), and we also suppose true variable set $supp (θ_{i}^{⋆})$ are correlated with a finite number of variables; Suppose our choice of kernel function is continuous and bounded, and suppose the regularisation parameter $ϵ_{n}$ satisfies: $ϵ_{n} \to 0; {(\frac{n}{p^{t}})}^{\frac{1}{t + 2}} ϵ_{n} \to \infty; p ≲ n^{α} such that α < 1 / t as n \to \infty .$ Then the functions $tr [{\hat{Σ}}_{yy | θ^{T} x}]$ and $tr [Σ_{yy | θ^{T} x}]$ are continuous on ${SP}_{1}^{p} (R)$ , and it converges in probability that: $sup_{θ \in {SP}_{1}^{p} (R)} | tr [{\hat{Σ}}_{yy | θ^{T} x}^{(n)}] - tr [Σ_{yy | θ^{T} x}] | \to 0.$

Theorem 6 establishes the consistency of the first KDR vector. The growth rate assumption in the theorem says that the growth rate of p is negatively affected by the sparsity level t. When t is smaller, p can diverge faster as $n \to \infty$ . Because of this uniform convergence, we establish the variable selection consistency by following Theorem 7, allowing a diverging dimension of p.

Theorem 5.8

Under all conditions in Theorem 6, for $p ≲ n^{α}$ , such that $α < 1 / t$ as $n \to \infty$ , and $k = 1, \dots, d$ , we have: $\Pr (supp ({\hat{θ}}_{k}) = supp (θ_{k}^{⋆})) \to 1.$

When variable selection consistency is achieved, we can further construct estimation consistency of $\hat{θ}$ :

Theorem 5.9

If all conditions in Theorem 7 hold, and we assume $p ≲ n^{α}$ , such that $α < 1 / t$ as $n \to \infty$ . Then we denote: $\hat{θ} = \underset{θ}{\arg min} tr (Σ_{yy | θ^{T} x}) s . t . ‖ θ ‖_{2} = 1, ‖ θ ‖_{0} \leq t$ And define the corresponding matrix ${\tilde{Θ}}_{1} = {\hat{θ}}_{1} {\hat{θ}}_{1}^{T}$ . When ${\tilde{Θ}}_{1}$ is nonempty, for any arbitrary open set U in $S_{1}^{p} (R$ ) with true direction $θ^{⋆} θ^{⋆ T} \in U$ , we have (4) $lim_{n \to \infty} Pr ({\tilde{Θ}}_{1} \in U) = 1.$ (4) For following directions, let $θ_{1}, \dots, θ_{k}$ correspond to the true first k KDR vectors, and ${\hat{θ}}_{k + 1} = \underset{θ}{\arg min} tr (Σ_{yy | (\sum_{i = 1}^{k} {\hat{θ}}_{i} {\hat{θ}}_{i}^{T} + θ θ^{T})^{T} x}) s . t . ‖ θ ‖_{2} = 1, ‖ θ ‖_{0} \leq t .$ Denote ${\tilde{Θ}}_{k + 1} = \sum_{i = 1}^{k} {\hat{θ}}_{i} {\hat{θ}}_{i}^{T} + {\hat{θ}}_{k + 1} {\hat{θ}}_{k + 1}^{T}$ , and the true space $Θ_{k + 1} = \sum_{i = 1}^{k + 1} (θ_{i}^{⋆} θ_{i}^{⋆})^{T}$ . When ${\hat{θ}}_{i}$ are all not nonempty for $i = 1, \dots, k$ , for any arbitrary open set U in $S_{k + 1}^{p} (R)$ with $Θ_{k + 1} \in U$ , we have (5) $lim_{n \to \infty} Pr ({\tilde{Θ}}_{k + 1} \in U) = 1.$ (5)

Theorem 8 extends the KSDR consistency results from Fukumizu et al. (Citation2009) to a diverging p case. Making use of the $ℓ_{0}$ -constrained parameter space, we have both variable selection and estimation consistency under certain conditions.

Following the statistical property, Theorem 9 shows the asymptotic equivalency of folded concave constraint and $ℓ_{0}$ constraint. Theorem 10 shows that our nonconvex linearised ADMM guarantees the convergence to an ϵ-stationary solution.

Similar with Theorem 4, for Theorem 9, we consider $\begin{aligned} min_{Θ \in R^{p \times d}} tr (Σ_{yy | Θ^{T} x}) + γ 1^{T} | Θ |_{0} 1 tr (Θ) = 1; Θ ⪰ 0 \\ min_{Θ \in R^{p \times d}} tr (Σ_{yy | Θ^{T} x}) + γ P_{δ} (| Θ |) tr (Θ) = 1; Θ ⪰ 0, \end{aligned}$ where $P_{δ} (| Θ |)$ is defined in Section 2. Considering d = 1, we can also solve sufficient directions iteratively, as introduced in Section 2.

Theorem 5.10

Suppose we choose ${δ_{u}}$ to be sequence such that ${lim}_{u \to \infty} δ_{u} = 0^{+}$ . For the programming (4.12), we choose penalisation function $P_{δ} (t)$ as we defined. Let $Θ^{u}$ be the global minimiser of (4.12) when we choose $δ = δ^{u}$ , and $Θ^{\infty}$ is a limit point of ${Θ^{u}}$ . Then $Θ^{\infty}$ is also a global minimiser to (4.11).

Parallel to Theorem 5, we show the property for the ϵ-stationary solution of our proposed nonconvex ADMM in Theorem 10.

Theorem 5.11

Let L be the Lipschitz constant of the loss function. Let $ρ > max (\frac{18 \sqrt{3} + 6}{13} L, 6 hL)$ and $γ \in {\begin{cases} (\frac{\sqrt{13 ρ^{2} - 12 ρL - 72 L^{2}} - 2 ρ}{9 ρ^{2} - 12 ρL - 72 L^{2}}, + \infty) & ρ \in (\frac{2 + 2 \sqrt{19}}{3} L, + \infty) \\ (\frac{2 ρ - \sqrt{13 ρ^{2} - 12 ρL - 72 L^{2}}}{72 L^{2} + 12 ρL - 9 ρ^{2}}, \\ \frac{2 ρ + \sqrt{13 ρ^{2} - 12 ρL - 72 L^{2}}}{72 L^{2} + 12 ρL - 9 ρ^{2}}) & ρ \in (\frac{18 \sqrt{3} + 6}{13} L, \frac{2 + 2 \sqrt{19}}{3} L] \end{cases} .$ For any $0 < ϵ < min {\frac{1}{L}, \frac{1}{3 h}}$ , with no more than $O (\frac{1}{ϵ^{4}})$ number of iterations, our proposed algorithm obtains an ϵ-stationary solution $(\hat{Θ}, {\hat{Ω}}_{1}, {\hat{Ω}}_{2}, {\hat{Λ}}_{1}, {\hat{Λ}}_{2})$ , as defined in Jiang et al. (Citation2019), for any $Θ \in R^{p \times p}$ with finite Frobenius norm,satisfying that $\begin{aligned} (vec (Θ - \hat{Θ}))^{T} [vec (\frac{\partial}{\partial Θ} tr ({\hat{Σ}}_{yy | {\hat{Θ}}^{T} x}) - ({\hat{Λ}}_{1} + {\hat{Λ}}_{2}) I)] \geq - ϵ \\ (Ω_{1} - {\hat{Ω}}_{1})^{T} [P_{λ}^{'} (Ω_{1}) - ({\hat{Λ}}_{1} + {\hat{Λ}}_{2}) I] \geq - ϵ \\ ‖ \hat{Θ} - {\hat{Ω}}_{1} ‖ \leq ϵ \\ ‖ \hat{Θ} - {\hat{Ω}}_{2} ‖ \leq ϵ \\ tr ({\hat{Ω}}_{2}) = 1, {\hat{Ω}}_{2} ⪰ 0. \end{aligned}$

6. Numerical properties

In this section, we evaluate the performance of the sparse SIR/DR/KSDR estimator in our semidefinite relaxation framework, the DT-SIR estimator proposed by Lin et al. (Citation2019), the variable selection of penalised linear regression with SCAD penalisation and marginal correlation screening.

Here we compare the numerical performance of these methods in four data generation models as follows with sample size n = 100.

Model 1: $y = \frac{0.5 (x_{1} + x_{2})}{0.5 + (0.5 (x_{1} - x_{2}) + 1.5)^{2}} + (1 + 0.5 (x_{1} - x_{2}))^{2} + σ^{2} e$ , where $e \sim N (0, 1)$ , $x \sim N (0, I_{10})$ , and $σ = 0.2, 0.4, 0.8$ for Model 1a–1c.
Model 2: $y = \frac{1}{2} (x_{1}^{2} + x_{2}^{2}) + σ^{2} e$ , where $e \sim N (0, 1)$ , $x \sim N (0, I_{20})$ , and $σ = 0.1, 0.2, 0.3$ for Model 2a–2c.
Model 3: $y = \frac{1}{2} (x_{1} + 1)^{2} + σ^{2} e$ , where $e \sim N (0, 1)$ , $x \sim N (0, I_{40})$ , and $σ = 0.2, 0.4, 0.8$ for Model 3a–3c.
Model 4: $y = \sin^{2} (π x_{1} + 1) + σ^{2} e$ , where $σ = 0.2, 0.4, 0.8$ for Model 4a-4c, $e \sim N (0, 1)$ , and $x$ is uniformly distributed on ${x \in R^{10} | \exists x_{i} > 0.7, for i = 1, 2, \dots 10}$

To compare the numerical performance of these methods, we replicate our data generation procedure 100 times and apply all the methods to these 100 replications. We compute the estimated sparse SDR vector ${\hat{θ}}_{k}$ for $k = 1, 2, \dots, d$ for each method each time. Then, we evaluate the performance of different methods using multiple $R^{2}$ , True Positive Rate (TPR) and True Negative Rate (TNR). The multiple $R^{2}$ is defined as $R^{2} = \frac{1}{d} \sum_{k = 1}^{d} ‖ P_{span (Θ^{⋆})} {\hat{θ}}_{k} ‖_{2}^{2} .$ True Positive Rate(TPR) is defined as the fraction of variables correctly specified as non-zero in at least one out of SDR vectors, and True Negative Rate(TNR) is the fraction of variables correctly specified as zero in all SDR vectors.

The simulation results are summarised in the following Table . In the following Tables and , we report the mean and standard deviation for these three indicators, with sparse SIR/DR/KSDR, DT-SIR, and SCAD penalised regression. The marginal correlation method doesn't have $R^{2}$ , and we report its true positive/negative rate in Table . The tuning procedure is as follows. For the proposed methods, we choose the penalisation parameter λ via the oracle tuning procedure. Specifically, we construct two independent datasets with the same sample size n = 100 for training and tuning respectively. For each λ, we estimate the parameter $Θ$ on the training dataset and evaluate the corresponding loss function on the tuning dataset. We choose the best penalisation parameter to minimise the loss function on the tuning dataset. In practice, we may use the cross-validation to find the proper penalisation parameter.

For implementation, Sparse SIR/DR/KSDR are implemented using our proposed nonconvex ADMM, DT-SIR is implemented using R package ‘LassoSIR’, and SCAD penalised regression is implemented using R package ‘ncvreg’. For the marginal correlation method, we compute the correlation between y and each variable $x_{i}$ , then rank the absolute value of the correlation and select the first d variables, where d is the true dimension.

We also compare the performance of $R^{2}$ for kernel dimension reduction in Fukumizu et al. (Citation2009) in Model 4(a, b, c) as a benchmark. The average $R^{2}$ are 0.642, 0.546 and 0.513 for Model 4(a), 4(b) and 4(c) separately. The results are comparable with SKDR results. SKDR performs slightly better given it eliminates some noise from irrelevant variables, thanks to the penalisation. Given kernel dimension reduction doesn't provide a sparse solution, and doesn't provide a meaningful TPR and TNR, we didn't include them in the main table.

We draw the following conclusions from the results: (1) sparse SIR and DT-SIR are compatible in most cases, while sparse SIR outperforms DT-SIR in $R^{2}$ in general. In model 1 and model 3, when the assumption for SIR holds, our sparse SIR is almost perfect. DT-SIR also performs well in these two settings but sometimes includes many irrelevant variables and leads to an unsatisfying TNR and $R^{2}$ . (2) The sparse DR is also compatible with SIR in general. Although it may not outperform SIR in model 1 and model 3, when the function is symmetric over $x$ in model 2, SIR fails to recover the true space, while sparse DR is very stable in this case. (3) sparse KSDR performs best in model 4 when linearity conditions thm(A1) and thm(A2) don't hold, and both SIR and DR lose consistency. While sparse KSDR also performs fairly well compared to SIR and DR in models 1–3, when the linearity condition holds and no matter whether the function is symmetric over $x$ . (4) The penalised regression fails in model 2 and model 4 when the relationship between $x$ and y cannot be approximated by linear regression. Also, regression cannot identify the linear combination $θ$ , which is also of interest. (5) Similarly, the marginal correlation method fails in model 2 when the correlation between $x$ and y is completely non-linear, and they have population correlation zero. What's more, using correlation-based methods also cannot identify the linear combination $θ$ of our interest.

7. Application to microbiome data analysis

In this section, we apply our method to analyse the host transcriptomic data and 16S rRNA gene sequencing data from paired biopsies from IPAA patients with UC and familial adenomatous polyposis (Morgan et al. Citation2015). Morgan et al. (Citation2015) extracted 5 principal components of microbiome 16s rRNA gene sequencing data, and analysed the relationship between these PCs and the PCs from host transcriptomic data. However, the result is hard to interpret since PCs are linear combinations of all host transcriptomics. Thus we apply our proposed sparse KSDR to explore the more interpretable sparse linear combinations which are relevant to these microbiome PCs.

First we select 20 representative genes of host transcriptomic data from two groups. The first group of genes are those most different from two locations (PPI and Porch) of a patient by Kolmogorov–Smirnov test. The second group of genes are important genes pointed out in Morgan et al. (Citation2015), which are important to the pathogenesis of inflammatory bowel disease (IBD). And we treat 9 principal components of microbiome 16s rRNA gene sequencing as y in our model separately, and apply our kernel sparse diemsnion reduction estimation on the host transcriptomic data of PPI for each patient, with sample size n = 196 and p = 20. The selected genes and their coefficients are summarised in the following Table .

Table 4. Selected genes and corresponding coefficients using KSDR.

Download CSV Display Table

In the result, leading PCs (PC1-PC5) are mainly correlated with protein-related genes such as TLR1 and MMEL1 It reflects that these PCs may be mainly composed of antibiotic-signature microbiome. Several genes that have been widely studied that are important to the pathogenesis of IBD is also selected. For example, IL10 is selected in PC1, MUC6 is selected in PC2, 3, 4, and 5, and IL1RN is selected in PC1, 3, 4. Among them, IL10 is well studied that its effectiveness in signalling a subgroup of patients with IBD (Begue et al. Citation2011). MUC6 may have a role in epithelial wound healing after mucosal injury in IBD in addition to mucosal protection (Buisine et al. Citation2001). IL1RN has a well-established pathological role in IBD (Stokkers et al. Citation1998). Our result emphasises the effects of these genes may be actively related to microbiome RNA environment of IBD patients.

8. Conclusion

Motivated by the desired property of sparse SDR, we propose the nonconvex estimation scheme for sparse KSDR. In aspect of statistical property, we prove the asymptotic consistency for both estimators. In the aspect of optimisation, we show that they can both be solved by nonconvex ADMM algorithm efficiently with a provable convergence guarantee. We also show the practical usefulness of our proposed methods in various simulation settings and real data analysis.

Supplemental material

Supplemental Material

Download PDF (365.5 KB)

Acknowledgments

The authors were grateful for the insightful comments and suggestions from the editor and reviewers.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

The authors were partially supported by the National Institutes of Health (NIH) grant 1R01GM152812 and National Science Foundation (NSF) grants DMS-1811552, DMS-1953189, CCF-2007823, and DMS-2210775.

References

Amini, A.A., and Wainwright, M.J. (2008), ‘High-Dimensional Analysis of Semidefinite Relaxations for Sparse Principal Components’, in Information Theory, 2008. ISIT 2008. IEEE International Symposium on, IEEE, pp. 2454–2458.
Google Scholar
Baker, C.R. (1973), ‘Joint Measures and Cross-Covariance Operators’, Transactions of the American Mathematical Society, 186, 273–289.
Web of Science ®Google Scholar
Begue, B., Verdier, J., Rieux-Laucat, F., Goulet, O., Morali, A., Canioni, D., Hugot, J.-P., Daussy, C., Verkarre, V., Pigneur, B., and Fischer, A. (2011), ‘Defective Il10 Signaling Defining a Subgroup of Patients with Inflammatory Bowel Disease’, Official Journal of the American College of Gastroenterology ACG, 106(8), 1544–1555.
PubMed Web of Science ®Google Scholar
Bondell, H.D., and Li, L. (2009), ‘Shrinkage Inverse Regression Estimation for Model-free Variable Selection’, Journal of the Royal Statistical Society Series B: Statistical Methodology, 71(1), 287–299.
Google Scholar
Buisine, M., Desreumaux, P., Leteurtre, E., Copin, M., Colombel, J., Porchet, N., and Aubert, J. (2001), ‘Mucin Gene Expression in Intestinal Epithelial Cells in Crohn's Disease’, Gut, 49(4), 544–551.
PubMed Web of Science ®Google Scholar
Chen, X., Zou, C., and Cook, R.D. (2010), ‘Coordinate-independent Sparse Sufficient Dimension Reduction and Variable Selection’, The Annals of Statistics, 38(6), 3696–3723.
Web of Science ®Google Scholar
Cook, R.D., and Weisberg, S. (1991), ‘Comment’, Journal of the American Statistical Association, 86(414), 328–332.
Web of Science ®Google Scholar
d'Aspremont, A., Ghaoui, L.E., Jordan, M.I., and Lanckriet, G.R. (2005), ‘A Direct Formulation for Sparse PCA Using Semidefinite Programming’, in Advances in Neural Information Processing Systems, pp. 41–48.
Google Scholar
Fan, J., and Li, R. (2001), ‘Variable Selection Via Nonconcave Penalized Likelihood and Its Oracle Properties’, Journal of the American Statistical Association, 96(456), 1348–1360.
Web of Science ®Google Scholar
Fan, J., Xue, L., and Yao, J. (2017), ‘Sufficient Forecasting Using Factor Models’, Journal of Econometrics, 201(2), 292–306.
PubMed Web of Science ®Google Scholar
Fan, J., Xue, L., and Zou, H. (2014), ‘Strong Oracle Optimality of Folded Concave Penalized Estimation’, The Annals of Statistics, 42(3), 819–849.
PubMed Web of Science ®Google Scholar
Frommlet, F., and Nuel, G. (2016), ‘An Adaptive Ridge Procedure for L 0 Regularization’, PloS One, 11(2), e0148620.
PubMed Web of Science ®Google Scholar
Fukumizu, K., Bach, F.R., and Jordan, M.I. (2009), ‘Kernel Dimension Reduction in Regression’, The Annals of Statistics, 37(4), 1871–1905.
Web of Science ®Google Scholar
Jiang, B., Lin, T., Ma, S., and Zhang, S. (2019), ‘Structured Nonconvex and Nonsmooth Optimization: Algorithms and Iteration Complexity Analysis’, Computational Optimization and Applications, 72(1), 115–157.
Web of Science ®Google Scholar
Li, K.-C. (1991), ‘Sliced Inverse Regression for Dimension Reduction’, Journal of the American Statistical Association, 86(414), 316–327.
Web of Science ®Google Scholar
Li, L. (2007), ‘Sparse Sufficient Dimension Reduction’, Biometrika, 94(3), 603–613.
Web of Science ®Google Scholar
Li, B., Chun, H., and Zhao, H. (2012), ‘Sparse Estimation of Conditional Graphical Models with Application to Gene Networks’, Journal of the American Statistical Association, 107(497), 152–167.
PubMed Web of Science ®Google Scholar
Li, B., and Wang, S. (2007), ‘On Directional Regression for Dimension Reduction’, Journal of the American Statistical Association, 102(479), 997–1008.
Web of Science ®Google Scholar
Lin, Q., Zhao, Z., and Liu, J.S. (2018), ‘On Consistency and Sparsity for Sliced Inverse Regression in High Dimensions’, The Annals of Statistics, 46(2), 580–610.
Web of Science ®Google Scholar
Lin, Q., Zhao, Z., and Liu, J.S. (2019), ‘Sparse Sliced Inverse Regression Via Lasso’, Journal of the American Statistical Association, 114(528), 1726–1739.
PubMed Web of Science ®Google Scholar
Liu, B., Zhang, Q., Xue, L., Song, P.X.-K., and Kang, J. (2024), ‘Robust High-dimensional Regression with Coefficient Thresholding and Its Application to Imaging Data Analysis’, Journal of the American Statistical Association, 119, 715–729.
PubMed Web of Science ®Google Scholar
Luo, W., Xue, L., Yao, J., and Yu, X. (2022), ‘Inverse Moment Methods for Sufficient Forecasting Using High-dimensional Predictors’, Biometrika, 109(2), 473–487.
Web of Science ®Google Scholar
Ma, S. (2013), ‘Alternating Direction Method of Multipliers for Sparse Principal Component Analysis’, Journal of the Operations Research Society of China, 1(2), 253–274.
Google Scholar
Mackey, L.W. (2009), ‘Deflation Methods for Sparse PCA’, in Advances in Neural Information Processing Systems, pp. 1017–1024.
Google Scholar
Morgan, X.C., Kabakchiev, B., Waldron, L., Tyler, A.D., Tickle, T.L., Milgrom, R., Stempak, J.M., Gevers, D., Xavier, R.J., and Silverberg, M.S. (2015), ‘Associations Between Host Gene Expression, the Mucosal Microbiome, and Clinical Outcome in the Pelvic Pouch of Patients with Inflammatory Bowel Disease’, Genome Biology, 16(1), 67.
PubMedGoogle Scholar
Neykov, M., Lin, Q., and Liu, J.S. (2016), ‘Signed Support Recovery for Single Index Models in High-dimensions’, Annals of Mathematical Sciences and Applications, 1(2), 379–426.
Web of Science ®Google Scholar
Shi, L., Huang, X., Feng, Y., and Suykens, J. (2019), ‘Sparse Kernel Regression with Coefficient-based Lq-regularization’, Journal of Machine Learning Research, 20(116), 1–44.
Google Scholar
Stokkers, P., Van Aken, B., Basoski, N., Reitsma, P., Tytgat, G., and Van Deventer, S. (1998), ‘Five Genetic Markers in the Interleukin 1 Family in Relation to Inflammatory Bowel Disease’, Gut, 43(1), 33–39.
PubMed Web of Science ®Google Scholar
Tan, K., Shi, L., and Yu, Z. (2020), ‘Sparse SIR: Optimal Rates and Adaptive Estimation’, The Annals of Statistics, 48(1), 64–85. https://doi.org/10.1214/18-AOS1791.
Web of Science ®Google Scholar
Tan, K.M., Wang, Z., Zhang, T., Liu, H., and Cook, R.D. (2018), ‘A Convex Formulation for High-dimensional Sparse Sliced Inverse Regression’, Biometrika, 105(4), 769–782.
Web of Science ®Google Scholar
Wold, S., Esbensen, K., and Geladi, P. (1987), ‘Principal Component Analysis’, Chemometrics and Intelligent Laboratory Systems, 2(1-3), 37–52.
Web of Science ®Google Scholar
Wu, C., Miller, J., Chang, Y., Sznaier, M., and Dy, J. (2019), ‘Solving Interpretable Kernel Dimensionality Reduction’, in Advances in Neural Information Processing Systems, pp. 7915–7925.
Google Scholar
Ying, C., and Yu, Z. (2022), ‘Fréchet Sufficient Dimension Reduction for Random Objects’, Biometrika, 109(4), 975–992.
Web of Science ®Google Scholar
Yu, X., Yao, J., and Xue, L. (2022), ‘Nonparametric Estimation and Conformal Inference of the Sufficient Forecasting with a Diverging Number of Factors’, Journal of Business & Economic Statistics, 40(1), 342–354.
Web of Science ®Google Scholar
Zhang, C.-H. (2010), ‘Nearly Unbiased Variable Selection Under Minimax Concave Penalty’, The Annals of Statistics, 38(2), 894–942.
Web of Science ®Google Scholar
Zhang, Q., Li, B., and Xue, L. (2024), ‘Nonlinear Sufficient Dimension Reduction for Distribution-on-distribution Regression’, Journal of Multivariate Analysis, 202, 105302.
PubMed Web of Science ®Google Scholar
Zhang, Q., Xue, L., and Li, B. (2024), ‘Dimension Reduction for Fréchet Regression’, Journal of the American Statistical Association, in press.
Web of Science ®Google Scholar
Zou, H. (2006), ‘The Adaptive Lasso and Its Oracle Properties’, Journal of the American Statistical Association, 101(476), 1418–1429.
Web of Science ®Google Scholar
Zou, H., Hastie, T., and Tibshirani, R. (2006), ‘Sparse Principal Component Analysis’, Journal of Computational and Graphical Statistics, 15(2), 265–286.
Web of Science ®Google Scholar
Zou, H., and Xue, L. (2018), ‘A Selective Overview of Sparse Principal Component Analysis’, Proceedings of the IEEE, 106(8), 1311–1320.
Web of Science ®Google Scholar

Sparse kernel sufficient dimension reduction

Abstract

1. Introduction

2. Preliminaries

3. Methodology

4. Nonconvex optimisation

4.1. Nonconvex ADMM for sparse SDR

4.2. Nonconvex linearised ADMM for sparse KSDR

5. Theoretical properties

5.1. Asymptotic properties of sparse SDR

5.2. Asymptotic properties of sparse KSDR

6. Numerical properties

Table 1. Summary of $R^{2}$ , TP and TN for model 1 and 2.

Table 2. Summary of $R^{2}$ , TP and TN for model 3 and 4.

Table 3. Summary of TP and TN for marginal correlation method.

7. Application to microbiome data analysis

Table 4. Selected genes and corresponding coefficients using KSDR.

8. Conclusion

Supplemental Material

Acknowledgments

Disclosure statement

References

Information for

Open access

Opportunities

Help and information

Sparse kernel sufficient dimension reduction

Abstract

1. Introduction

2. Preliminaries

3. Methodology

4. Nonconvex optimisation

4.1. Nonconvex ADMM for sparse SDR

4.2. Nonconvex linearised ADMM for sparse KSDR

5. Theoretical properties

5.1. Asymptotic properties of sparse SDR

5.2. Asymptotic properties of sparse KSDR

6. Numerical properties

Table 1. Summary of R2, TP and TN for model 1 and 2.

Table 2. Summary of R2, TP and TN for model 3 and 4.

Table 3. Summary of TP and TN for marginal correlation method.

7. Application to microbiome data analysis

Table 4. Selected genes and corresponding coefficients using KSDR.

8. Conclusion

Supplemental Material

Acknowledgments

Disclosure statement

Additional information

Funding

References

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date

Table 1. Summary of $R^{2}$ , TP and TN for model 1 and 2.

Table 2. Summary of $R^{2}$ , TP and TN for model 3 and 4.