Abstract
The sufficient dimension reduction (SDR) with sparsity has received much attention for analysing high-dimensional data. We study a nonparametric sparse kernel sufficient dimension reduction (KSDR) based on the reproducing kernel Hilbert space, which extends the methodology of the sparse SDR based on inverse moment methods. We establish the statistical consistency and efficient estimation of the sparse KSDR under the high-dimensional setting where the dimension diverges as the sample size increases. Computationally, we introduce a new nonconvex alternating directional method of multipliers (ADMM) to solve the challenging sparse SDR and propose the nonconvex linearised ADMM to solve the more challenging sparse KSDR. We study the computational guarantees of the proposed ADMMs and show an explicit iteration complexity bound to reach the stationary solution. We demonstrate the finite-sample properties in simulation studies and a real application.
1. Introduction
Consider the sufficient dimension reduction (SDR) of high-dimensional random vector in the presence of response . We aim to find the low-dimensional central subspace such that and , where the dimension p may diverge as the sample size increases, and is the projection of a vector onto the subspace . The inverse moment methods are extensively studied under the fixed-dimension setting, including the sliced inverse regression (SIR) (Li Citation1991), sliced average variance estimation (SAVE) (Cook and Weisberg Citation1991), and directional regression (DR) (Li and Wang Citation2007). Fan, Xue, and Yao (Citation2017), Luo, Xue, Yao, and Yu (Citation2022), and Yu, Yao, and Xue (Citation2022) extended these inverse moment methods and introduced sufficient forecasting using factor models for high-dimensional predictors when both . Ying and Yu (Citation2022), Zhang, Xue, and Li (Citation2024), and Zhang, Li, and Xue (Citation2024) studied the sufficient dimension reduction for Fréchet regression and distribution-on-distribution regression. (Lin, Zhao, and Liu Citation2018; Lin, Zhao, and Liu Citation2019) introduced the Lasso-SIR under the high-dimensional setting when and the diagonal thresholding SIR (DT-SIR) introduced a new diagonal thresholding screening under the ultra high-dimensional setting when . However, the restricted linearity condition on the conditional expectation of given is required for inverse moment methods. To remove the linearity condition, the kernel sufficient dimension reduction (KSDR) was proposed by Fukumizu, Bach, and Jordan (Citation2009). The nonconvex KSDR does not require the linearity condition nor the elliptical distribution of , but it poses a significant computational challenge. Wu, Miller, Chang, Sznaier, and Dy (Citation2019) solved this problem using an iterative spectral method, but such a method is hard to extend when a sparsity assumption is assumed in high-dimensional settings. Shi, Huang, Feng, and Suykens (Citation2019) studies nonconvex penalty in a kernel regression problem. Although both the theoretical property and optimisation algorithm cannot be applied in SDR settings, it shows the promises of nonconvex penalised KSDR in high-dimensional settings.
The sparse estimation of central subspace has received considerable attention in the past decade. The main focus of the literature is on sparse SDR based on inverse moment methods, which is equivalent to solving the generalised eigenvectors from the estimated kernel matrix (Li Citation2007). For example, the kernel matrix for SIR is the conditional covariance matrix of given y. In view of such equivalence, Li (Citation2007) followed the similar spirit of the sparse principal component analysis (SPCA) (Zou, Hastie, and Tibshirani Citation2006; Zou and Xue Citation2018) to propose the sparse SDR using penalisation. Bondell and Li (Citation2009) proposed a shrinkage inverse regression estimation and studied the asymptotic consistency and asymptotic normality. Chen, Zou, and Cook (Citation2010) proposed a coordinate-independent sparse estimation method and established the oracle property when the dimension p is fixed. Lin et al. (Citation2019) established the consistency of the Lasso-SIR by assuming the number of slices also diverges. Neykov, Lin, and Liu (Citation2016) followed the convex semidefinite programming (d'Aspremont, Ghaoui, Jordan, and Lanckriet Citation2005; Amini and Wainwright Citation2008) to give a high-dimensional analysis for the sign consistency of the DT-SIR in a single-index model. Tan, Wang, Zhang, Liu, and Cook (Citation2018) proposed a sparse SIR estimator based on convex formulation and established the Frobenius norm convergence of projection subspace. Lin et al. (Citation2018) established the consistency of DT-SIR by assuming a stability condition for . Tan, Shi, and Yu (Citation2020) proved the optimality of -constrained sparse solution and provided a three-steps adaptive estimation to approximate solution with near-optimal property. The sparse estimation in the KSDR area has not been well studied.
In this paper, we propose the nonconvex estimation scheme for sparse KSDR, motivated by the nonconvex estimation of sparse SDR (Li Citation2007). The proposed methods target the ideal -constrained estimator and can be efficiently solved with statistical and computational guarantees after approximation. We follow d'Aspremont et al. (Citation2005) to reformulate sparse SDR and sparse KSDR as the -constrained trace maximisation problem. It is worth pointing out that KSDR is also equivalent to solving the eigenvectors from the nonconvex residual covariance estimator in RKHS (Fukumizu et al. Citation2009). The residual covariance estimator in RKHS plays a similar role as the kernel matrix in inverse moment methods. Thus, we propose that sparse SDR solves the -constrained convex M-estimation problem, whereas sparse KSDR solves the -constrained nonconvex M-estimation problem. We provide a high-dimensional analysis for the global solution of the -constrained SDR and prove its asymptotic properties, including both variable selection consistency and convergence rate. We demonstrate the power of this general theory by obtaining the explicit convergence rates for the -constrained SDR under the high-dimensional setting when the dimension p diverges. Furthermore, we also provide a high-dimensional analysis of the global solution of the -constrained KSDR and prove its asymptotic properties, such as the consistency in variable selection and estimation, without assuming the linearity condition.
Computationally, the -constrained trace maximisation is NP-hard. There has been some work showing success in adjusting the weight of coefficients adaptively, using convex penalties such as adaptive-lasso (Zou Citation2006) and adaptive-ridge regression (Frommlet and Nuel Citation2016). The performance of adaptive-based approaches heavily relies on the accuracy of initial estimation. Seeking for a more robust penalisation approximation without adaptive procedure, we use the folded concave penalisation (Fan and Li Citation2001; Fan, Xue, and Zou Citation2014) as an alternative to the penalisation in Li (Citation2007), Lin et al. (Citation2019), and Lin et al. (Citation2018). We propose the novel nonconvex alternating direction method of multipliers (ADMM) to solve the folded concave penalised SDR and the nonconvex linearised ADMM to solve the folded concave penalised KSDR. In particular, sparse KSDR has not only a nonconvex loss function but also a folded concave penalty and a positive definite constraint. We provide the important convergence guarantees of the proposed nonconvex ADMMs for solving two challenging nonconvex and nonsmooth problems. We also establish the explicit iteration complexity bound for the proposed nonconvex ADMM and its linearised variant to reach the stationary solution of sparse SDR and sparse KSDR, respectively. To the best of our knowledge, our work is the first in the literature to study the challenging nonconvex and nonsmooth optimisation problem for sparse KSDR.
The rest of this paper is organised as follows. We will revisit the methods for sparse SDR in Section 2 and then present the proposed methods for sparse KSDR in Section 3. The proposed nonconvex ADMM and its linearised variant are presented in Sectio 4, and statistical and computational guarantees are presented in Section 5. Section 6 evaluates the numerical performance in several different simulation settings. Section 7 applies sparse KDR in a real data analysis. Section 7 includes a few concluding remarks.
Before proceeding, we define some useful notations. Let , and is the cardinality of a set . For a matrix , let be the maximum norm, the matrix norm, the spectral norm, the Frobenius norm, the trace norm, defined as . Let denote the norm in d-dimensional Hilbert space. To estimate the central subspace, it is equivalent to finding a low-rank matrix of full column rank such that . We may also denote the central subspace by , as the column space of matrix .
2. Preliminaries
This section first revisits the sparse SDR methods in Li (Citation2007) and then follows d'Aspremont et al. (Citation2005) to reformulate sparse SDR via the trace maximisation problem. Note that SDR aims to estimate the linear combination such that given independently and identically distributed data . Li (Citation2007) showed that SDR methods based on inverse moments, such as SIR, DR, SAVE, and their variants, essentially solve the generalised eigenvector of a certain kernel matrix. When we assume the product of the inverse covariance matrix and kernel matrix is a normal matrix, the problem is also equivalent to solving an eigenvector of a certain kernel matrix. More specifically, at population level, the kernel matrix of SIR is ; the kernel matrix of DR is: where and the covariance matrix of all predictors ; the kernel matrix of DT-SIR (Lin et al. Citation2018) is where is the subset of random variables in after the diagonal thresholding.
Given the finite sample size n, we have an empirical version of such kernel matrices , by replacing population covariance matrix with sample covariance matrix , and all means with empirical mean . For example, . Li (Citation2007) shows these SDR methods are equivalent to solving the leading eigenvector of the empirical version of such kernel matrices.
After estimating the kernel matrix , we study the sparse estimation of the leading SDR direction, i.e. the largest eigenvector of . We use the well-known fact: solving the largest eigenvector is equivalent to trace optimisation (Wold, Esbensen, and Geladi Citation1987) as follows: To impose the sparsity, we introduce constraint in the following trace maximisation to penalise the cardinality of non-zero elements in vector : Note that . With , then it is equivalent to solving the positive semidefinite estimate from where is defined as the element-wise norm. When and , implies , and is equivalent to . However, the rank-one constraint that is hard to solve efficiently in practice. Therefore, we consider a relaxation of removing this constraint. Instead, we solve the dominant eigenvector of as , which is the SDR direction of our interest. Generally, this relaxation tends to have a solution very close to rank one matrix (d'Aspremont et al. Citation2005).
After removing the rank-one constraint, it is equivalent to solve from the following penalised estimation problem for some sufficiently large γ: (1) (1) where given , we follow d'Aspremont et al. (Citation2005) to truncate it to obtain the dominant (sparse) eigenvector of , and it will be the leading sparse SDR direction of our interest.
In the sequel, given , we solve the following sparse SDR direction in a sequential fashion. We denote , and for , let be the projection of on the space orthogonal to previous SDR directions. To remove the effect of previous eigenvectors, is estimated as the largest (sparse) eigenvector of matrix , which is a Hotelling's deflation from previous vectors: Mackey (Citation2009) studied the property of Hotelling's deflation in this type of sparse eigenvector estimation problem. Especially when we estimate the sparse eigenvector on the true support, this deflation helps eliminate the influence of a given eigenvector by setting the associated eigenvalue to zero and achieve . For the sparse estimation of , we plug in to (Equation1(1) (1) ), then solve from: which is further equivalent to the following constrained problem following the same rank relaxation: (2) (2) (Equation2(2) (2) ) is still a nonconvex problem since constraint is included. To solve the computational challenge, one existing relaxation is using folded concave penalty to approximate constraint. Let be a folded concave penalty function such as the SCAD (Fan and Li Citation2001) and MCP (Zhang Citation2010), where is the matrix such that taking the absolute value of matrix element-wisely.
Specifically, in this paper, we consider the element-wise SCAD penalty as: where a is a constant, usually chosen as 3.7 in the SCAD penalty. In Section 5, we discuss how such penalisation can well approximate penalisation. In this paper, We use SCAD penalty in this paper for simplicity. It is worth mentioning that we can also apply other nonconvex penalties and approximations here, such as MCP (Zhang Citation2010) and coefficient thresholding (Liu, Zhang, Xue, Song, and Kang Citation2024) to achieve the same theoretical property and similar numerical performance.
With this approximation, we solve from the following folded concave penalised estimation of sparse SDR: (3) (3) and then for , sequentially solve , given , from Without loss of generality, this idea of estimating leading sparse eigenvalues can also be applied to the sparse Principal Component Analysis (PCA) problem.
Theorem 4 will provide the computational guarantee for the above approximation to penalisation. Section 6 will also present the numerical performance.
3. Methodology
The methodology and theory of the sparse SDR depend on the linearity condition and constant variance condition, which usually hold when predictors follow elliptical distributions, but elliptical conditions can be violated in many cases. In this section, following the framework built by Fukumizu et al. (Citation2009) and Li, Chun, and Zhao (Citation2012), we extend the framework to achieve sparse KSDR, which does not require linearity and constant variance conditions. The key concept is that we construct a conditional covariance operator in reproducing kernel Hilbert space (RKHS), which evaluates the amount of information y that cannot be explained by the function of linear combinations of . Then, we derive the empirical estimator of such conditional covariance operator with finite samples and finally minimise the trace of such estimator with sparsity constraints.
First, we provide the necessary definitions in RKHS to construct the conditional covariance operator. Let and be positive definite kernel functions and and be the RKHS generated by and , such that satisfies reproducing property: And satisfies such reproducing property accordingly in space . Then define be the variance operator of , which satisfies the relation for all , and so as . Define as the covariance operator satisfying for all and , and be the adjoint operator of . Based on Baker (Citation1973), we know there exists unique bounded operators and satisfy: Then, the conditional covariance operator can be defined as: For convenience, we sometime write it as: where is the pseudo inverse of . Specifically, , as the Tychonoff regularised inverse matrix of , given some sufficient small constant .
Also, define the sparse orthogonal space for i=1,…,d, representing the leading d sparse SDR directions. When d = 1, it represents the leading sparse SDR vector.
In what follows, we study the sparse estimation of such that . To simplify notation, we introduce , and , , and are defined in the same manner as , , and . Then we have: As shown in Theorem 4 of Fukumizu et al. (Citation2009), if and are rich enough RKHS, and also for any , we have This conclusion holds for our sparse orthogonal space as well, under the sparsity assumption of SDR vectors.
This result indicates that such a conditional covariance operator is a good estimator for sufficient dimension reduction at the population level. However, the population-level operator is not available in practice. At the sample level, we need to derive the empirical version of such an operator in coordinate representation and minimise it. To start with, we have the coordinate representation of the kernel matrix as: Let be the centralised gram matrix, where . Similarly, we define . Following Lemma 1 of Li et al. (Citation2012), under the spanning systems: The variance and covariance operators can be written as: Placing these coordinate representations into , we derive the coordinate representation of as: where is the Tychonoff regularised inverse matrix of given some sufficient small constant . Then we minimise the trace of to estimate sparse SDR directions from .
To deal with the orthogonality constraint in , we use a similar sequential procedure as in Section 2. To this end, we solve from the following trace minimisation to minimise the sum of conditional variance: where It is worth pointing out the equivalence between the above trace minimisation problem and the following regularised nonparametric regression problem: This equivalence provides insight into sparse KSDR, and it is also useful to establish the theoretical properties in Section 5.
In sparse KSDR, we assume y is the function of . In this case, are the true sparse SDR vectors of our interest. Previously, we have introduced how to estimate . Here, we introduce how to solve these non-leading SDR directions sequentially.
Given that we have already estimated orthogonal sufficient directions for , we sequentially solve the following KSDR direction from the minimisation of . By orthogonality of all sufficient directions, . Hence, we only need to solve Then we project to the space orthogonal to all previous KSDR vectors, denoted as the final estimator .
Here, we justify why this projection procedure is correct. Define as the projection of on the direction orthogonal to all previous k KSDR vectors, such that . By definition of the conditional covariance operator, we have: Since the non-orthogonal direction to previous KSDR directions plays no effect in reducing the trace, we only need to consider the projection for the sparse estimation of the following KSDR direction . By the projection procedure, we guarantee that .
Finally, we discuss solving constraint and rank one constraint. Recall that is a folded concave penalty. In practice, we solve the following folded concave relaxation of sparse KSDR to approximate the constraint inside , and solve from which can be rewritten in terms of a folded concave penalised estimation problem: Given , we follow Section 2 to truncate it to obtain the first KSDR direction . Next, in the sequential fashion, given for , we solve from where we used the fact that . Now, we compute the projection and solve the dominant (sparse) eigenvector as the desired sparse KSDR direction .
4. Nonconvex optimisation
Section 4 first proposes an efficient nonconvex ADMM to solve the nonconvex optimisation of sparse SDR and then presents an efficient nonconvex linearised ADMM to solve the even more challenging nonconvex optimisation of sparse KSDR. Both nonconvex ADMMs enjoy the important computational guarantees, which will be presented in Section 5.
4.1. Nonconvex ADMM for sparse SDR
In this section, we propose a novel nonconvex ADMM to solve the challenging nonconvex and nonsmooth optimisation in the sparse SDR. The proposed nonconvex ADMM enjoys the convergence guarantees to the ‘ϵ-stationary solution’ (see the definition in Theorem 5) with an explicit iteration complexity bound. The convergence guarantee of nonconvex algorithms can be defined as varying from algorithms. The ϵ-stationary solution introduced in Jiang, Lin, Ma, and Zhang (Citation2019) is an applicable one to depict the convergence in the aspect of the Lagrangian function. More details about the ϵ-stationary solution and computational guarantees can be found in Theorem 5 of Section 5.1.
To begin with, we introduce a new variable and also a slacking variable . By adding the equality constraint , (Equation3(3) (3) ) is equivalent to: The slacking variable plays an essential role in the convergence guarantee of our proposed nonconvex ADMM. Accordingly, the augmented Lagrange function is where is the Lagrange variable and . Given the initial value , we solve iteratively for as follows: where h>0 is chosen to guarantee the convexity of step and γ & ρ are chosen to guarantee the convergence. Their choices will be discussed in Theorem 5. Note that the step and the step are straightforward. Fortunately, we also have the closed-form solutions in both step and step. In the step, we know that which can be analytically solved by using the univariate folded concave thresholding rule More specifically, we have In the step, it is easy to see that which can be solved by using the projection onto the simplex in the Euclidean space (See Algorithm 1 of Ma Citation2013). To solve , we take the singular value decomposition of as . After analytically solving the projection of onto the simplex in the Euclidean space. Denoting , it has closed-form solution: Then we obtain the closed-form solution .
In the step, we solve the derivative as: Now, we summarise the proposed nonconvex ADMM in Algorithm 1.
4.2. Nonconvex linearised ADMM for sparse KSDR
In this section, we provide a detailed derivation of the nonconvex linearised ADMM algorithm for sparse KSDR. We present the algorithm for solving the leading KSDR vector, and the algorithm for the following vectors follows similarly. To begin with, we introduce new variables and and slacking variables and . By adding two equality constraints that , (2.7) is equivalent to: The slacking variables and are essential for the convergence guarantee of our proposed algorithm. Accordingly, the augmented Lagrange function is where and are the Lagrange variable and .
It is important to point out that introduces the additional computational challenges since the step does not lead to a closed-form solution. To this end, we introduce a linearised counterpart of this trace function as After plugging this linearised term, we denote the new Lagrange function as Given the initial solution , we solve iteratively for as follows: where h>0 is chosen to guarantee the convexity of step and are chosen to guarantee the algorithmic convergence (see Theorem 10). In what follows, we present the closed-form solution for each subproblem.
For step, with the linearisation, can be solved by taking derivatives as: Defining , then , then the derivatives can be solved by using the chain rule of matrix derivatives: which has a complex closed form. For space consideration, the detailed calculation of closed-form derivatives and its numerical approximation is deferred to the supplement. Based on , we can obtain the close-form derivative. By solving the linearisation, we obtain the minimiser: For step, it is equivalent to solve the subproblem This can be efficiently solved by using the univariate penalised least squares solution: where has been defined in Section 4.1. For step, we are solving the optimisation problem: It can be solved by projection onto the simplex in the Euclidean space. We have presented exactly the same algorithm in Section 4.1. Thus, we omit it here. For and steps, we can solve them straightforwardly: Now we can summarise the proposed nonconvex linearised ADMM in Algorithm 2:
5. Theoretical properties
In this section, we establish a theoretical guarantee for both sparse SDR and sparse KSDR estimators. For both SDR and KSDR, first we study the statistical guarantee of -constrained estimator, showing their consistency, allowing p diverging as . Then, we study how our folded concave penalised estimator can approximate -constrained estimator and prove the computation guarantee of our proposed nonconvex ADMM and its linearised version.
5.1. Asymptotic properties of sparse SDR
We study the statistical and computational properties of -constrained SDR, such as the consistency and convergence rate in this subsection and the computational guarantees of folded concave penalised SDR in this subsection. The -constrained SDR achieves the desired theoretical properties under the high-dimensional setting, while folded concave penalised SDR is computationally feasible and it asymptotically converges to -constrained SDR. First, we define the true sufficient dimension vectors. For the selected method, d is the smallest integer such that, the top d eigenvectors of are for , such that for . Then by definition, we know are true sufficient vectors. Now we introduce assumptions thm(A1)–thm(A3).
(A1)
is a linear function of , for any .
(A2)
is degenerate.
(A3)
True sufficient dimension vectors satisfy for .
Note that thm(A1), thm(A2), and thm(A3) are also called the linearity condition, the constant variance condition, and the sparsity condition, respectively. thm(A1) and thm(A2) hold when follows elliptical distributions. thm(A3) is a standard assumption in sparse SDR literature. See Li (Citation1991), Li and Wang (Citation2007), Li (Citation2007) and Luo et al. (Citation2022) for justifications of these conditions.
Let . We first study the convergence rate of the first estimated sparse SDR direction . Theorem 1 derives this estimation bound based on the general estimation bound that .
Theorem 5.1
Suppose (A1)–(A3), , , , and the spectral decomposition for exists. Then we have:
In practice, has been derived for each specific SDR method. Corollary 1 discusses the explicit convergence rate, given the explicit convergence rate for different methods.
Corollary 5.2
Suppose (A1)–(A3), , , covariance matrix , and as , then for sparse SIR and sparse DR, we have:
Let corresponds the stability of the central curve . When , with proper choice of number of slices and (A3)–(A4) in Lin et al. (Citation2018) are satisfied, for sparse SIR, we have: Further, if we have , we have convergence rate:
Let where . Under the high-dimensional setting that , by assuming (A3)–(A6) and (S1) in Lin et al. (Citation2018), and , then for the sparse DT-SIR, with proper choice of , we have convergence rate:
where the stability v of the central curve is defined in definition 1 of Lin et al. (Citation2019). We also provide the complete definition of it in the supplemental material.
It is worth mentioning that our framework is very flexible. By applying different SDR methods to estimate the the matrix, we can derive the convergence rate under different settings. Let is the support set of . We further prove the variable selection consistency in Theorem 2.
Theorem 5.3
Under the assumption of Theorem 1, further if for some constant , and , then with probability tending to 1 as .
Next, recall that in Section 2, for , by the same optimisation procedure, we can further estimate from . Next, we study the estimation bound and consistency for () in Theorem 3.
Theorem 5.4
Suppose that the conditions in Theorems 1 and 2 are satisfied, , and for . Then for , we have: where and are defined in Section 2.
Now, we provide the computational guarantees of our proposed nonconvex ADMM. First, we show that the folded concave penalised estimation well approximates the penalisation. To this end, we use the rescaled folded concave penalty function, as we defined in Section 2. It is easy to see for any . Thus, converges to in the penalisation. For matrix , we define the penalisation , which is the summation of univariate rescaled penalisation for all entries. Here, we reveal the connection between penalised programming (Equation1(1) (1) ) and scaled folded-concave penalised programming (Equation3(3) (3) ) in the following theorem.
Theorem 5.5
Suppose we choose to be sequence such that . For the programming (Equation3(3) (3) ), we choose penalisation function as we defined. Let be the global minimiser of (Equation3(3) (3) ) when we choose , and assume is a limit point of . Then is also a global minimiser to (Equation1(1) (1) ).
Since the nonconvex problem (Equation3(3) (3) ) has widely varying curvature, it is nontrivial to prove the algorithmic convergence. By taking advantage of the structured nonconvexity, the next theorem shows that our proposed nonconvex ADMM converges to an ϵ-stationary solution with an explicit iteration complexity bound. In the following theorem, we use to represent the vectorisation of a matrix from to a vector .
Theorem 5.6
Let L be the Lipschitz bound of loss function, and For any , with no more than number of iterations, our proposed algorithm obtains an ϵ-stationary solution , as defined in Jiang et al. (Citation2019), for any matrix with finite Frobenius norm satisfying that:
Remark: When , the constraint exactly implicates the KKT-condition. Thus, the solution will be a stationary solution. We will show numerical performance of such a near-stationary solution in the next section.
5.2. Asymptotic properties of sparse KSDR
In this section, we establish the consistency of kernel sparse dimension reduction under milder assumptions. Based on Fukumizu et al. (Citation2009), we assume following assumptions thm(A4)–thm(A7).
(A4)
For any bounded continuous function g on , the mapping is continuous on for k=1,2,…,d.
(A5)
For any , let , and be the probability distribution of the random variable on . The Hilbert space is dense in for any .
(A6)
There is a measurable function such that and the Lipschitz condition: holds for all and for any k=1,2,…,d, and the kernel is defined in Section 3. D is a distance which is compatible with the topology of . And we assume .
(A7)
The true sparse SDR directions satisfies .
Note that thm(A1) and thm(A2) are not required for the sparse KSDR, and we rewrite the sparsity assumption thm(A3) in our defined sparse orthogonal space as thm(A7). In thm(A4) to thm(A6), we introduce some mild regularity conditions in RKHS, which are well justified in Fukumizu et al. (Citation2009).
Theorem 5.7
Under assumption (A4) to (A7), and we also suppose true variable set are correlated with a finite number of variables; Suppose our choice of kernel function is continuous and bounded, and suppose the regularisation parameter satisfies: Then the functions and are continuous on , and it converges in probability that:
Theorem 6 establishes the consistency of the first KDR vector. The growth rate assumption in the theorem says that the growth rate of p is negatively affected by the sparsity level t. When t is smaller, p can diverge faster as . Because of this uniform convergence, we establish the variable selection consistency by following Theorem 7, allowing a diverging dimension of p.
Theorem 5.8
Under all conditions in Theorem 6, for , such that as , and , we have:
When variable selection consistency is achieved, we can further construct estimation consistency of :
Theorem 5.9
If all conditions in Theorem 7 hold, and we assume , such that as . Then we denote: And define the corresponding matrix . When is nonempty, for any arbitrary open set U in ) with true direction , we have (4) (4) For following directions, let correspond to the true first k KDR vectors, and Denote , and the true space . When are all not nonempty for , for any arbitrary open set U in with , we have (5) (5)
Theorem 8 extends the KSDR consistency results from Fukumizu et al. (Citation2009) to a diverging p case. Making use of the -constrained parameter space, we have both variable selection and estimation consistency under certain conditions.
Following the statistical property, Theorem 9 shows the asymptotic equivalency of folded concave constraint and constraint. Theorem 10 shows that our nonconvex linearised ADMM guarantees the convergence to an ϵ-stationary solution.
Similar with Theorem 4, for Theorem 9, we consider where is defined in Section 2. Considering d = 1, we can also solve sufficient directions iteratively, as introduced in Section 2.
Theorem 5.10
Suppose we choose to be sequence such that . For the programming (4.12), we choose penalisation function as we defined. Let be the global minimiser of (4.12) when we choose , and is a limit point of . Then is also a global minimiser to (4.11).
Parallel to Theorem 5, we show the property for the ϵ-stationary solution of our proposed nonconvex ADMM in Theorem 10.
Theorem 5.11
Let L be the Lipschitz constant of the loss function. Let and For any , with no more than number of iterations, our proposed algorithm obtains an ϵ-stationary solution , as defined in Jiang et al. (Citation2019), for any with finite Frobenius norm,satisfying that
6. Numerical properties
In this section, we evaluate the performance of the sparse SIR/DR/KSDR estimator in our semidefinite relaxation framework, the DT-SIR estimator proposed by Lin et al. (Citation2019), the variable selection of penalised linear regression with SCAD penalisation and marginal correlation screening.
Here we compare the numerical performance of these methods in four data generation models as follows with sample size n = 100.
Model 1: , where , , and for Model 1a–1c.
Model 2: , where , , and for Model 2a–2c.
Model 3: , where , , and for Model 3a–3c.
Model 4: , where for Model 4a-4c, , and is uniformly distributed on
To compare the numerical performance of these methods, we replicate our data generation procedure 100 times and apply all the methods to these 100 replications. We compute the estimated sparse SDR vector for for each method each time. Then, we evaluate the performance of different methods using multiple , True Positive Rate (TPR) and True Negative Rate (TNR). The multiple is defined as True Positive Rate(TPR) is defined as the fraction of variables correctly specified as non-zero in at least one out of SDR vectors, and True Negative Rate(TNR) is the fraction of variables correctly specified as zero in all SDR vectors.
The simulation results are summarised in the following Table . In the following Tables and , we report the mean and standard deviation for these three indicators, with sparse SIR/DR/KSDR, DT-SIR, and SCAD penalised regression. The marginal correlation method doesn't have , and we report its true positive/negative rate in Table . The tuning procedure is as follows. For the proposed methods, we choose the penalisation parameter λ via the oracle tuning procedure. Specifically, we construct two independent datasets with the same sample size n = 100 for training and tuning respectively. For each λ, we estimate the parameter on the training dataset and evaluate the corresponding loss function on the tuning dataset. We choose the best penalisation parameter to minimise the loss function on the tuning dataset. In practice, we may use the cross-validation to find the proper penalisation parameter.
For implementation, Sparse SIR/DR/KSDR are implemented using our proposed nonconvex ADMM, DT-SIR is implemented using R package ‘LassoSIR’, and SCAD penalised regression is implemented using R package ‘ncvreg’. For the marginal correlation method, we compute the correlation between y and each variable , then rank the absolute value of the correlation and select the first d variables, where d is the true dimension.
We also compare the performance of for kernel dimension reduction in Fukumizu et al. (Citation2009) in Model 4(a, b, c) as a benchmark. The average are 0.642, 0.546 and 0.513 for Model 4(a), 4(b) and 4(c) separately. The results are comparable with SKDR results. SKDR performs slightly better given it eliminates some noise from irrelevant variables, thanks to the penalisation. Given kernel dimension reduction doesn't provide a sparse solution, and doesn't provide a meaningful TPR and TNR, we didn't include them in the main table.
We draw the following conclusions from the results: (1) sparse SIR and DT-SIR are compatible in most cases, while sparse SIR outperforms DT-SIR in in general. In model 1 and model 3, when the assumption for SIR holds, our sparse SIR is almost perfect. DT-SIR also performs well in these two settings but sometimes includes many irrelevant variables and leads to an unsatisfying TNR and . (2) The sparse DR is also compatible with SIR in general. Although it may not outperform SIR in model 1 and model 3, when the function is symmetric over in model 2, SIR fails to recover the true space, while sparse DR is very stable in this case. (3) sparse KSDR performs best in model 4 when linearity conditions thm(A1) and thm(A2) don't hold, and both SIR and DR lose consistency. While sparse KSDR also performs fairly well compared to SIR and DR in models 1–3, when the linearity condition holds and no matter whether the function is symmetric over . (4) The penalised regression fails in model 2 and model 4 when the relationship between and y cannot be approximated by linear regression. Also, regression cannot identify the linear combination , which is also of interest. (5) Similarly, the marginal correlation method fails in model 2 when the correlation between and y is completely non-linear, and they have population correlation zero. What's more, using correlation-based methods also cannot identify the linear combination of our interest.
7. Application to microbiome data analysis
In this section, we apply our method to analyse the host transcriptomic data and 16S rRNA gene sequencing data from paired biopsies from IPAA patients with UC and familial adenomatous polyposis (Morgan et al. Citation2015). Morgan et al. (Citation2015) extracted 5 principal components of microbiome 16s rRNA gene sequencing data, and analysed the relationship between these PCs and the PCs from host transcriptomic data. However, the result is hard to interpret since PCs are linear combinations of all host transcriptomics. Thus we apply our proposed sparse KSDR to explore the more interpretable sparse linear combinations which are relevant to these microbiome PCs.
First we select 20 representative genes of host transcriptomic data from two groups. The first group of genes are those most different from two locations (PPI and Porch) of a patient by Kolmogorov–Smirnov test. The second group of genes are important genes pointed out in Morgan et al. (Citation2015), which are important to the pathogenesis of inflammatory bowel disease (IBD). And we treat 9 principal components of microbiome 16s rRNA gene sequencing as y in our model separately, and apply our kernel sparse diemsnion reduction estimation on the host transcriptomic data of PPI for each patient, with sample size n = 196 and p = 20. The selected genes and their coefficients are summarised in the following Table .
In the result, leading PCs (PC1-PC5) are mainly correlated with protein-related genes such as TLR1 and MMEL1 It reflects that these PCs may be mainly composed of antibiotic-signature microbiome. Several genes that have been widely studied that are important to the pathogenesis of IBD is also selected. For example, IL10 is selected in PC1, MUC6 is selected in PC2, 3, 4, and 5, and IL1RN is selected in PC1, 3, 4. Among them, IL10 is well studied that its effectiveness in signalling a subgroup of patients with IBD (Begue et al. Citation2011). MUC6 may have a role in epithelial wound healing after mucosal injury in IBD in addition to mucosal protection (Buisine et al. Citation2001). IL1RN has a well-established pathological role in IBD (Stokkers et al. Citation1998). Our result emphasises the effects of these genes may be actively related to microbiome RNA environment of IBD patients.
8. Conclusion
Motivated by the desired property of sparse SDR, we propose the nonconvex estimation scheme for sparse KSDR. In aspect of statistical property, we prove the asymptotic consistency for both estimators. In the aspect of optimisation, we show that they can both be solved by nonconvex ADMM algorithm efficiently with a provable convergence guarantee. We also show the practical usefulness of our proposed methods in various simulation settings and real data analysis.
Supplemental Material
Download PDF (365.5 KB)Acknowledgments
The authors were grateful for the insightful comments and suggestions from the editor and reviewers.
Disclosure statement
No potential conflict of interest was reported by the author(s).
Additional information
Funding
References
- Amini, A.A., and Wainwright, M.J. (2008), ‘High-Dimensional Analysis of Semidefinite Relaxations for Sparse Principal Components’, in Information Theory, 2008. ISIT 2008. IEEE International Symposium on, IEEE, pp. 2454–2458.
- Baker, C.R. (1973), ‘Joint Measures and Cross-Covariance Operators’, Transactions of the American Mathematical Society, 186, 273–289.
- Begue, B., Verdier, J., Rieux-Laucat, F., Goulet, O., Morali, A., Canioni, D., Hugot, J.-P., Daussy, C., Verkarre, V., Pigneur, B., and Fischer, A. (2011), ‘Defective Il10 Signaling Defining a Subgroup of Patients with Inflammatory Bowel Disease’, Official Journal of the American College of Gastroenterology ACG, 106(8), 1544–1555.
- Bondell, H.D., and Li, L. (2009), ‘Shrinkage Inverse Regression Estimation for Model-free Variable Selection’, Journal of the Royal Statistical Society Series B: Statistical Methodology, 71(1), 287–299.
- Buisine, M., Desreumaux, P., Leteurtre, E., Copin, M., Colombel, J., Porchet, N., and Aubert, J. (2001), ‘Mucin Gene Expression in Intestinal Epithelial Cells in Crohn's Disease’, Gut, 49(4), 544–551.
- Chen, X., Zou, C., and Cook, R.D. (2010), ‘Coordinate-independent Sparse Sufficient Dimension Reduction and Variable Selection’, The Annals of Statistics, 38(6), 3696–3723.
- Cook, R.D., and Weisberg, S. (1991), ‘Comment’, Journal of the American Statistical Association, 86(414), 328–332.
- d'Aspremont, A., Ghaoui, L.E., Jordan, M.I., and Lanckriet, G.R. (2005), ‘A Direct Formulation for Sparse PCA Using Semidefinite Programming’, in Advances in Neural Information Processing Systems, pp. 41–48.
- Fan, J., and Li, R. (2001), ‘Variable Selection Via Nonconcave Penalized Likelihood and Its Oracle Properties’, Journal of the American Statistical Association, 96(456), 1348–1360.
- Fan, J., Xue, L., and Yao, J. (2017), ‘Sufficient Forecasting Using Factor Models’, Journal of Econometrics, 201(2), 292–306.
- Fan, J., Xue, L., and Zou, H. (2014), ‘Strong Oracle Optimality of Folded Concave Penalized Estimation’, The Annals of Statistics, 42(3), 819–849.
- Frommlet, F., and Nuel, G. (2016), ‘An Adaptive Ridge Procedure for L 0 Regularization’, PloS One, 11(2), e0148620.
- Fukumizu, K., Bach, F.R., and Jordan, M.I. (2009), ‘Kernel Dimension Reduction in Regression’, The Annals of Statistics, 37(4), 1871–1905.
- Jiang, B., Lin, T., Ma, S., and Zhang, S. (2019), ‘Structured Nonconvex and Nonsmooth Optimization: Algorithms and Iteration Complexity Analysis’, Computational Optimization and Applications, 72(1), 115–157.
- Li, K.-C. (1991), ‘Sliced Inverse Regression for Dimension Reduction’, Journal of the American Statistical Association, 86(414), 316–327.
- Li, L. (2007), ‘Sparse Sufficient Dimension Reduction’, Biometrika, 94(3), 603–613.
- Li, B., Chun, H., and Zhao, H. (2012), ‘Sparse Estimation of Conditional Graphical Models with Application to Gene Networks’, Journal of the American Statistical Association, 107(497), 152–167.
- Li, B., and Wang, S. (2007), ‘On Directional Regression for Dimension Reduction’, Journal of the American Statistical Association, 102(479), 997–1008.
- Lin, Q., Zhao, Z., and Liu, J.S. (2018), ‘On Consistency and Sparsity for Sliced Inverse Regression in High Dimensions’, The Annals of Statistics, 46(2), 580–610.
- Lin, Q., Zhao, Z., and Liu, J.S. (2019), ‘Sparse Sliced Inverse Regression Via Lasso’, Journal of the American Statistical Association, 114(528), 1726–1739.
- Liu, B., Zhang, Q., Xue, L., Song, P.X.-K., and Kang, J. (2024), ‘Robust High-dimensional Regression with Coefficient Thresholding and Its Application to Imaging Data Analysis’, Journal of the American Statistical Association, 119, 715–729.
- Luo, W., Xue, L., Yao, J., and Yu, X. (2022), ‘Inverse Moment Methods for Sufficient Forecasting Using High-dimensional Predictors’, Biometrika, 109(2), 473–487.
- Ma, S. (2013), ‘Alternating Direction Method of Multipliers for Sparse Principal Component Analysis’, Journal of the Operations Research Society of China, 1(2), 253–274.
- Mackey, L.W. (2009), ‘Deflation Methods for Sparse PCA’, in Advances in Neural Information Processing Systems, pp. 1017–1024.
- Morgan, X.C., Kabakchiev, B., Waldron, L., Tyler, A.D., Tickle, T.L., Milgrom, R., Stempak, J.M., Gevers, D., Xavier, R.J., and Silverberg, M.S. (2015), ‘Associations Between Host Gene Expression, the Mucosal Microbiome, and Clinical Outcome in the Pelvic Pouch of Patients with Inflammatory Bowel Disease’, Genome Biology, 16(1), 67.
- Neykov, M., Lin, Q., and Liu, J.S. (2016), ‘Signed Support Recovery for Single Index Models in High-dimensions’, Annals of Mathematical Sciences and Applications, 1(2), 379–426.
- Shi, L., Huang, X., Feng, Y., and Suykens, J. (2019), ‘Sparse Kernel Regression with Coefficient-based Lq-regularization’, Journal of Machine Learning Research, 20(116), 1–44.
- Stokkers, P., Van Aken, B., Basoski, N., Reitsma, P., Tytgat, G., and Van Deventer, S. (1998), ‘Five Genetic Markers in the Interleukin 1 Family in Relation to Inflammatory Bowel Disease’, Gut, 43(1), 33–39.
- Tan, K., Shi, L., and Yu, Z. (2020), ‘Sparse SIR: Optimal Rates and Adaptive Estimation’, The Annals of Statistics, 48(1), 64–85. https://doi.org/10.1214/18-AOS1791.
- Tan, K.M., Wang, Z., Zhang, T., Liu, H., and Cook, R.D. (2018), ‘A Convex Formulation for High-dimensional Sparse Sliced Inverse Regression’, Biometrika, 105(4), 769–782.
- Wold, S., Esbensen, K., and Geladi, P. (1987), ‘Principal Component Analysis’, Chemometrics and Intelligent Laboratory Systems, 2(1-3), 37–52.
- Wu, C., Miller, J., Chang, Y., Sznaier, M., and Dy, J. (2019), ‘Solving Interpretable Kernel Dimensionality Reduction’, in Advances in Neural Information Processing Systems, pp. 7915–7925.
- Ying, C., and Yu, Z. (2022), ‘Fréchet Sufficient Dimension Reduction for Random Objects’, Biometrika, 109(4), 975–992.
- Yu, X., Yao, J., and Xue, L. (2022), ‘Nonparametric Estimation and Conformal Inference of the Sufficient Forecasting with a Diverging Number of Factors’, Journal of Business & Economic Statistics, 40(1), 342–354.
- Zhang, C.-H. (2010), ‘Nearly Unbiased Variable Selection Under Minimax Concave Penalty’, The Annals of Statistics, 38(2), 894–942.
- Zhang, Q., Li, B., and Xue, L. (2024), ‘Nonlinear Sufficient Dimension Reduction for Distribution-on-distribution Regression’, Journal of Multivariate Analysis, 202, 105302.
- Zhang, Q., Xue, L., and Li, B. (2024), ‘Dimension Reduction for Fréchet Regression’, Journal of the American Statistical Association, in press.
- Zou, H. (2006), ‘The Adaptive Lasso and Its Oracle Properties’, Journal of the American Statistical Association, 101(476), 1418–1429.
- Zou, H., Hastie, T., and Tibshirani, R. (2006), ‘Sparse Principal Component Analysis’, Journal of Computational and Graphical Statistics, 15(2), 265–286.
- Zou, H., and Xue, L. (2018), ‘A Selective Overview of Sparse Principal Component Analysis’, Proceedings of the IEEE, 106(8), 1311–1320.