214
Views
0
CrossRef citations to date
0
Altmetric
Research Article

Sparse kernel sufficient dimension reduction

&
Received 01 Nov 2023, Accepted 13 May 2024, Published online: 06 Jun 2024

Abstract

The sufficient dimension reduction (SDR) with sparsity has received much attention for analysing high-dimensional data. We study a nonparametric sparse kernel sufficient dimension reduction (KSDR) based on the reproducing kernel Hilbert space, which extends the methodology of the sparse SDR based on inverse moment methods. We establish the statistical consistency and efficient estimation of the sparse KSDR under the high-dimensional setting where the dimension diverges as the sample size increases. Computationally, we introduce a new nonconvex alternating directional method of multipliers (ADMM) to solve the challenging sparse SDR and propose the nonconvex linearised ADMM to solve the more challenging sparse KSDR. We study the computational guarantees of the proposed ADMMs and show an explicit iteration complexity bound to reach the stationary solution. We demonstrate the finite-sample properties in simulation studies and a real application.

1. Introduction

Consider the sufficient dimension reduction (SDR) of high-dimensional random vector xRp in the presence of response yR. We aim to find the low-dimensional central subspace Sy|xRp such that dim(Sy|x)=dp and yx|PSy|x(x), where the dimension p may diverge as the sample size increases, and PSy|x() is the projection of a vector onto the subspace Sy|x. The inverse moment methods are extensively studied under the fixed-dimension setting, including the sliced inverse regression (SIR) (Li Citation1991), sliced average variance estimation (SAVE) (Cook and Weisberg Citation1991), and directional regression (DR) (Li and Wang Citation2007). Fan, Xue, and Yao (Citation2017), Luo, Xue, Yao, and Yu (Citation2022), and Yu, Yao, and Xue (Citation2022) extended these inverse moment methods and introduced sufficient forecasting using factor models for high-dimensional predictors when both n,p. Ying and Yu (Citation2022), Zhang, Xue, and Li (Citation2024), and Zhang, Li, and Xue (Citation2024) studied the sufficient dimension reduction for Fréchet regression and distribution-on-distribution regression. (Lin, Zhao, and Liu Citation2018; Lin, Zhao, and Liu Citation2019) introduced the Lasso-SIR under the high-dimensional setting when p/n0 and the diagonal thresholding SIR (DT-SIR) introduced a new diagonal thresholding screening under the ultra high-dimensional setting when p/n. However, the restricted linearity condition on the conditional expectation of x given PSy|x(x) is required for inverse moment methods. To remove the linearity condition, the kernel sufficient dimension reduction (KSDR) was proposed by Fukumizu, Bach, and Jordan (Citation2009). The nonconvex KSDR does not require the linearity condition nor the elliptical distribution of x, but it poses a significant computational challenge. Wu, Miller, Chang, Sznaier, and Dy (Citation2019) solved this problem using an iterative spectral method, but such a method is hard to extend when a sparsity assumption is assumed in high-dimensional settings. Shi, Huang, Feng, and Suykens (Citation2019) studies nonconvex penalty in a kernel regression problem. Although both the theoretical property and optimisation algorithm cannot be applied in SDR settings, it shows the promises of nonconvex penalised KSDR in high-dimensional settings.

The sparse estimation of central subspace has received considerable attention in the past decade. The main focus of the literature is on sparse SDR based on inverse moment methods, which is equivalent to solving the generalised eigenvectors from the estimated kernel matrix (Li Citation2007). For example, the kernel matrix for SIR is the conditional covariance matrix of x given y. In view of such equivalence, Li (Citation2007) followed the similar spirit of the sparse principal component analysis (SPCA) (Zou, Hastie, and Tibshirani Citation2006; Zou and Xue Citation2018) to propose the sparse SDR using 1 penalisation. Bondell and Li (Citation2009) proposed a shrinkage inverse regression estimation and studied the asymptotic consistency and asymptotic normality. Chen, Zou, and Cook (Citation2010) proposed a coordinate-independent sparse estimation method and established the oracle property when the dimension p is fixed. Lin et al. (Citation2019) established the consistency of the Lasso-SIR by assuming the number of slices also diverges. Neykov, Lin, and Liu (Citation2016) followed the convex semidefinite programming (d'Aspremont, Ghaoui, Jordan, and Lanckriet Citation2005; Amini and Wainwright Citation2008) to give a high-dimensional analysis for the sign consistency of the DT-SIR in a single-index model. Tan, Wang, Zhang, Liu, and Cook (Citation2018) proposed a sparse SIR estimator based on convex formulation and established the Frobenius norm convergence of projection subspace. Lin et al. (Citation2018) established the consistency of DT-SIR by assuming a stability condition for E(x|y). Tan, Shi, and Yu (Citation2020) proved the optimality of 0-constrained sparse solution and provided a three-steps adaptive estimation to approximate 0 solution with near-optimal property. The sparse estimation in the KSDR area has not been well studied.

In this paper, we propose the nonconvex estimation scheme for sparse KSDR, motivated by the nonconvex estimation of sparse SDR (Li Citation2007). The proposed methods target the ideal 0-constrained estimator and can be efficiently solved with statistical and computational guarantees after approximation. We follow d'Aspremont et al. (Citation2005) to reformulate sparse SDR and sparse KSDR as the 0-constrained trace maximisation problem. It is worth pointing out that KSDR is also equivalent to solving the eigenvectors from the nonconvex residual covariance estimator in RKHS (Fukumizu et al. Citation2009). The residual covariance estimator in RKHS plays a similar role as the kernel matrix in inverse moment methods. Thus, we propose that sparse SDR solves the 0-constrained convex M-estimation problem, whereas sparse KSDR solves the 0-constrained nonconvex M-estimation problem. We provide a high-dimensional analysis for the global solution of the 0-constrained SDR and prove its asymptotic properties, including both variable selection consistency and convergence rate. We demonstrate the power of this general theory by obtaining the explicit convergence rates for the 0-constrained SDR under the high-dimensional setting when the dimension p diverges. Furthermore, we also provide a high-dimensional analysis of the global solution of the 0-constrained KSDR and prove its asymptotic properties, such as the consistency in variable selection and estimation, without assuming the linearity condition.

Computationally, the 0-constrained trace maximisation is NP-hard. There has been some work showing success in adjusting the weight of coefficients adaptively, using convex penalties such as adaptive-lasso (Zou Citation2006) and adaptive-ridge regression (Frommlet and Nuel Citation2016). The performance of adaptive-based approaches heavily relies on the accuracy of initial estimation. Seeking for a more robust 0 penalisation approximation without adaptive procedure, we use the folded concave penalisation (Fan and Li Citation2001; Fan, Xue, and Zou Citation2014) as an alternative to the 1 penalisation in Li (Citation2007), Lin et al. (Citation2019), and Lin et al. (Citation2018). We propose the novel nonconvex alternating direction method of multipliers (ADMM) to solve the folded concave penalised SDR and the nonconvex linearised ADMM to solve the folded concave penalised KSDR. In particular, sparse KSDR has not only a nonconvex loss function but also a folded concave penalty and a positive definite constraint. We provide the important convergence guarantees of the proposed nonconvex ADMMs for solving two challenging nonconvex and nonsmooth problems. We also establish the explicit iteration complexity bound for the proposed nonconvex ADMM and its linearised variant to reach the stationary solution of sparse SDR and sparse KSDR, respectively. To the best of our knowledge, our work is the first in the literature to study the challenging nonconvex and nonsmooth optimisation problem for sparse KSDR.

The rest of this paper is organised as follows. We will revisit the methods for sparse SDR in Section 2 and then present the proposed methods for sparse KSDR in Section 3. The proposed nonconvex ADMM and its linearised variant are presented in Sectio 4, and statistical and computational guarantees are presented in Section 5. Section 6 evaluates the numerical performance in several different simulation settings. Section 7 applies sparse KDR in a real data analysis. Section 7 includes a few concluding remarks.

Before proceeding, we define some useful notations. Let θ0=card{θj0,j=1,d}, and |S| is the cardinality of a set S. For a matrix A, let A be the maximum norm, A1 the matrix 1 norm, A2 the spectral norm, AF the Frobenius norm, Atr the trace norm, defined as Atr=Tr[(ATA)1/2]. Let Hd denote the norm in d-dimensional Hilbert space. To estimate the central subspace, it is equivalent to finding a low-rank matrix ΘRp×d of full column rank such that yx | ΘTx. We may also denote the central subspace by Sy|x=span(Θ), as the column space of matrix Θ.

2. Preliminaries

This section first revisits the sparse SDR methods in Li (Citation2007) and then follows d'Aspremont et al. (Citation2005) to reformulate sparse SDR via the trace maximisation problem. Note that SDR aims to estimate the linear combination Θ such that yx | ΘTx given independently and identically distributed data (x1,y1),,(xn,yn). Li (Citation2007) showed that SDR methods based on inverse moments, such as SIR, DR, SAVE, and their variants, essentially solve the generalised eigenvector of a certain kernel matrix. When we assume the product of the inverse covariance matrix and kernel matrix is a normal matrix, the problem is also equivalent to solving an eigenvector of a certain kernel matrix. More specifically, at population level, the kernel matrix of SIR is Msir=Σx1cov[E{z|y}]; the kernel matrix of DR is: Mdr=2Σx1(E[E2(zzT|y)]+E2[E(z|y)E(zT|y)]+E[E(zT|y)E(z|y)]E[E(z|y)E(zT|y)]Ip),where z=xE(x) and Σx the covariance matrix of all predictors x; the kernel matrix of DT-SIR (Lin et al. Citation2018) is MDTsir=ΣxDT1cov[E{(xDTExDT)|y}], where xDT is the subset of random variables in x after the diagonal thresholding.

Given the finite sample size n, we have an empirical version of such kernel matrices Mˆ, by replacing population covariance matrix Σx with sample covariance matrix Σˆx, and all means E() with empirical mean En(). For example, Mˆsir=Σˆx1cov[En{z|y}]. Li (Citation2007) shows these SDR methods are equivalent to solving the leading eigenvector of the empirical version of such kernel matrices.

After estimating the kernel matrix Mˆ, we study the sparse estimation of the leading SDR direction, i.e. the largest eigenvector of Mˆ. We use the well-known fact: solving the largest eigenvector θ1 is equivalent to trace optimisation (Wold, Esbensen, and Geladi Citation1987) as follows: maxθtr(θTMˆθ)s.t.θ221.To impose the sparsity, we introduce 0 constraint in the following trace maximisation to penalise the cardinality of non-zero elements in vector θ: θˆ1=argmaxθtr(θTMˆθ)s.t.θ221,θ0tNote that span(θTx)=span(θθTx). With Θ=θθT, then it is equivalent to solving the positive semidefinite estimate Θˆ1 from maxΘRp×ptr(MˆΘ)s.t.tr(Θ)=1;1TΘ01t2;Θ0;rank(Θ)=1where Θ0 is defined as the element-wise 0 norm. When Θ0 and rank(Θ)=1, tr(Θ)=1 implies θ22=1, and 1TΘ01t2 is equivalent to θ0t. However, the rank-one constraint that rank(Θ)=1 is hard to solve efficiently in practice. Therefore, we consider a relaxation of removing this constraint. Instead, we solve the dominant eigenvector of Θˆ1 as θˆ1, which is the SDR direction of our interest. Generally, this relaxation tends to have a solution very close to rank one matrix (d'Aspremont et al. Citation2005).

After removing the rank-one constraint, it is equivalent to solve Θˆ1 from the following 0 penalised estimation problem for some sufficiently large γ: (1) maxΘRp×ptr(MˆΘ)γ1TΘ01s.t.tr(Θ)=1;Θ0,(1) where given Θˆ1, we follow d'Aspremont et al. (Citation2005) to truncate it to obtain the dominant (sparse) eigenvector of θˆ1, and it will be the leading sparse SDR direction of our interest.

In the sequel, given θˆ1,,θˆk, we solve the following sparse SDR direction θˆk+1 in a sequential fashion. We denote Mˆ1=Mˆ, and for k1, let Mˆk+1 be the projection of Mˆk on the space orthogonal to previous SDR directions. To remove the effect of previous eigenvectors, θˆk+1 is estimated as the largest (sparse) eigenvector of matrix Mˆk+1, which is a Hotelling's deflation from previous vectors: Mˆk+1=Mˆk(θˆkTMˆkθˆk)θˆkθˆkT.Mackey (Citation2009) studied the property of Hotelling's deflation in this type of sparse eigenvector estimation problem. Especially when we estimate the sparse eigenvector on the true support, this deflation helps eliminate the influence of a given eigenvector by setting the associated eigenvalue to zero and achieve θˆ1θˆ2θˆk. For the sparse estimation of θˆk+1, we plug in Mˆk+1 to (Equation1), then solve θk+1 from: maxθRptr(θTMˆk+1θ)s.t.θ221,θ0t,which is further equivalent to the following constrained problem following the same rank relaxation: (2) maxΘRp×ptr(Mˆk+1Θ)γ1TΘ01s.t.tr(Θ)=1;Θ0.(2) (Equation2) is still a nonconvex problem since 0 constraint is included. To solve the computational challenge, one existing relaxation is using folded concave penalty to approximate 0 constraint. Let Pλ(|Θ|) be a folded concave penalty function such as the SCAD (Fan and Li Citation2001) and MCP (Zhang Citation2010), where |Θ| is the matrix such that taking the absolute value of matrix Θ element-wisely.

Specifically, in this paper, we consider the element-wise SCAD penalty Pδ(t) as: Pδ(t)={2at(a+1)δ|t|δa,1(δ|t|)2(11a2)δ2δa|t|δ,1|t|δ.where a is a constant, usually chosen as 3.7 in the SCAD penalty. In Section 5, we discuss how such penalisation can well approximate 0 penalisation. In this paper, We use SCAD penalty in this paper for simplicity. It is worth mentioning that we can also apply other nonconvex penalties and approximations here, such as MCP (Zhang Citation2010) and coefficient thresholding (Liu, Zhang, Xue, Song, and Kang Citation2024) to achieve the same theoretical property and similar numerical performance.

With this approximation, we solve Θˆ1 from the following folded concave penalised estimation of sparse SDR: (3) maxΘRp×ptr(MˆΘ)λPδ(|Θ|)s.t.tr(Θ)=1;Θ0,(3) and then for k=1,2,,d1, sequentially solve Θˆk+1, given Mˆk+1, from maxΘRp×ptr(Mˆk+1Θ)λPδ(|Θ|)s.t.tr(Θ)=1;Θ0.Without loss of generality, this idea of estimating leading sparse eigenvalues can also be applied to the sparse Principal Component Analysis (PCA) problem.

Theorem 4 will provide the computational guarantee for the above approximation to 0 penalisation. Section 6 will also present the numerical performance.

3. Methodology

The methodology and theory of the sparse SDR depend on the linearity condition and constant variance condition, which usually hold when predictors follow elliptical distributions, but elliptical conditions can be violated in many cases. In this section, following the framework built by Fukumizu et al. (Citation2009) and Li, Chun, and Zhao (Citation2012), we extend the framework to achieve sparse KSDR, which does not require linearity and constant variance conditions. The key concept is that we construct a conditional covariance operator in reproducing kernel Hilbert space (RKHS), which evaluates the amount of information y that cannot be explained by the function of linear combinations of x. Then, we derive the empirical estimator of such conditional covariance operator with finite samples and finally minimise the trace of such estimator with sparsity constraints.

First, we provide the necessary definitions in RKHS to construct the conditional covariance operator. Let kx:Ωx×ΩxR and ky:Ωy×ΩyR be positive definite kernel functions and Hx and Hy be the RKHS generated by kx and ky, such that Hx satisfies reproducing property: xΩx,k(,x)Hx;xΩx,∀fHx,f,k(,x)Hx=f(x).And Hy satisfies such reproducing property accordingly in space Hy. Then define Σxx:HxHx be the variance operator of x, which satisfies the relation f,ΣxxgHx=cov(f(x),g(x)) for all f,gHx, and so as Σyy. Define Σyx:HxHy as the covariance operator satisfying g,ΣyxfHy=cov(g(y),f(x)) for all gHy and fHx, and Σxy=Σyx be the adjoint operator of Σyx. Based on Baker (Citation1973), we know there exists unique bounded operators Vyx and Vxy satisfy: Σyx=Σyy1/2VyxΣxx1/2;Σxy=Σxx1/2VxyΣyy1/2.Then, the conditional covariance operator can be defined as: Σyy|x=ΣyyΣyy1/2VyxVxyΣyy1/2.For convenience, we sometime write it as: Σyy|x=ΣyyΣyxΣxxΣxy. where Σxx is the pseudo inverse of Σxx. Specifically, Σxx=(Σxx+ϵsI)1, as the Tychonoff regularised inverse matrix of Σxx, given some sufficient small constant ϵs.

Also, define the sparse orthogonal space Sdp(R)={ΘM(p×d;R))|ΘTΘ=Id;Θi0k for i=1,…,d}, representing the leading d sparse SDR directions. When d = 1, it represents the leading sparse SDR vector.

In what follows, we study the sparse estimation of ΘSdp(R) such that yx |ΘTx. To simplify notation, we introduce S=ΘTx, and ks, Σss, Σys and Σsy are defined in the same manner as kx, Σxx, Σyx and Σxy. Then we have: Σyy|s=ΣyyΣysΣssΣsyAs shown in Theorem 4 of Fukumizu et al. (Citation2009), if Hx and Hy are rich enough RKHS, yx |ΘTxΣyy|s=Σyy|x, and also for any ΘSdp(R), we have Σyy|ΘTxΣyy|x. This conclusion holds for our sparse orthogonal space as well, under the sparsity assumption of SDR vectors.

This result indicates that such a conditional covariance operator is a good estimator for sufficient dimension reduction at the population level. However, the population-level operator is not available in practice. At the sample level, we need to derive the empirical version of such an operator in coordinate representation and minimise it. To start with, we have the coordinate representation of the kernel matrix as: Kx=(kx(x1,x1)kx(x1,xn)kx(xn,x1)kx(xn,xn)).Let Gx=QKxQ be the centralised gram matrix, where Q=In1n1nT/n. Similarly, we define Ky,Ks,Gy,Gs. Following Lemma 1 of Li et al. (Citation2012), under the spanning systems: HˆX=span{kx(,xi)Enkx(,x):i=1,,n},HˆY=span{ky(,yi)Enky(,y):i=1,,n}.The variance and covariance operators can be written as: Σˆxx=Gx,Σˆyx=Gx,Σˆxy=Gy,Σˆyy=GyΣˆss=Gs,Σˆsy=Gy,Σˆys=Gs.Placing these coordinate representations into Σyy|s, we derive the coordinate representation of Σˆyy|s as: Σˆyy|s=GyGsGsGy,where Gs=(Gs+ϵsI)1 is the Tychonoff regularised inverse matrix of Gs given some sufficient small constant ϵs. Then we minimise the trace of tr(Σˆyy|ΘTx) to estimate sparse SDR directions from ΘSdp(R).

To deal with the orthogonality constraint in Sdp(R), we use a similar sequential procedure as in Section 2. To this end, we solve θˆ1 from the following trace minimisation to minimise the sum of conditional variance: minθSP1p(R)tr(Σˆyy|θTx)where SP1p(R)={θ|θRp,θ2=1;θ0t}. It is worth pointing out the equivalence between the above trace minimisation problem and the following regularised nonparametric regression problem: minΘS1p(R)minfHXS1ni=1na=1E|ξa(Yi)1nj=1n(ξa(Yj))(f(Xi)1nj=1n[f(Xj)])|2}+ϵnnfHXS2This equivalence provides insight into sparse KSDR, and it is also useful to establish the theoretical properties in Section 5.

In sparse KSDR, we assume y is the function of θ1Tx,,θdTx. In this case, θ1,,θd are the true sparse SDR vectors of our interest. Previously, we have introduced how to estimate θ1. Here, we introduce how to solve these non-leading SDR directions sequentially.

Given that we have already estimated orthogonal sufficient directions θˆ1,,θˆk for k1, we sequentially solve the following KSDR direction θk+1 from the minimisation of tr(Σyy|θˆ1Tx,,θˆkTx,θTx). By orthogonality of all sufficient directions, tr(Σyy|θˆ1Tx,,θˆkTx,θTx)=tr(Σyy|(θˆ1θˆ1T++θˆkθˆkT+θθT)x). Hence, we only need to solve θˆk+1init=minθSP1p(R)tr(Σyy|(θˆ1θˆ1T++θˆkθˆkT+θθT)x),Then we project θˆk+1init to the space orthogonal to all previous KSDR vectors, denoted as the final estimator θˆk+1.

Here, we justify why this projection procedure is correct. Define θp=Pθˆ1,.θˆkθ as the projection of θ on the direction orthogonal to all previous k KSDR vectors, such that span(θˆ1,,θˆk,θ)=span(θˆ1,,θˆk,θp). By definition of the conditional covariance operator, we have: tr(Σyy|θˆ1Tx,,θˆkTx,θTx)=tr(Σyy|θˆ1Tx,,θˆkTx,θpTx).Since the non-orthogonal direction to previous KSDR directions plays no effect in reducing the trace, we only need to consider the projection θp=Pθˆ1,.θˆkθ for the sparse estimation of the following KSDR direction θk+1. By the projection procedure, we guarantee that θˆ1θˆ2θˆk.

Finally, we discuss solving 0 constraint and rank one constraint. Recall that Pλ(|Θ|) is a folded concave penalty. In practice, we solve the following folded concave relaxation of sparse KSDR to approximate the 0 constraint inside SP1p(R), and solve Θˆ1 from minΘRp×ptr(Σyy|ΘTx)s.t.tr(Θ)=1;Pδ(|Θ|)t;Θ0,which can be rewritten in terms of a folded concave penalised estimation problem: minΘRp×ptr(Σyy|ΘTx)+λPδ(|Θ|)s.t.tr(Θ)=1;Θ0.Given Θˆ1, we follow Section 2 to truncate it to obtain the first KSDR direction θˆ1. Next, in the sequential fashion, given θˆ1,,θˆk for k1, we solve Θˆk+1 from minΘRp×ptr(Σyy|(θˆ1θˆ1T++θˆkθˆkT+ΘT)x)+Pλ(|Θ|)s.t.tr(Θ)=1;Θ0,where we used the fact that tr(Σyy|θˆ1Tx,,θˆkTx,θTx)=tr(Σyy|θˆ1Tx,,θˆkTx,θpTx). Now, we compute the projection Θk+1proj=Pθ1θkΘˆk+1 and solve the dominant (sparse) eigenvector as the desired sparse KSDR direction θˆk+1.

4. Nonconvex optimisation

Section 4 first proposes an efficient nonconvex ADMM to solve the nonconvex optimisation of sparse SDR and then presents an efficient nonconvex linearised ADMM to solve the even more challenging nonconvex optimisation of sparse KSDR. Both nonconvex ADMMs enjoy the important computational guarantees, which will be presented in Section 5.

4.1. Nonconvex ADMM for sparse SDR

In this section, we propose a novel nonconvex ADMM to solve the challenging nonconvex and nonsmooth optimisation in the sparse SDR. The proposed nonconvex ADMM enjoys the convergence guarantees to the ‘ϵ-stationary solution’ (see the definition in Theorem 5) with an explicit iteration complexity bound. The convergence guarantee of nonconvex algorithms can be defined as varying from algorithms. The ϵ-stationary solution introduced in Jiang, Lin, Ma, and Zhang (Citation2019) is an applicable one to depict the convergence in the aspect of the Lagrangian function. More details about the ϵ-stationary solution and computational guarantees can be found in Theorem 5 of Section 5.1.

To begin with, we introduce a new variable Ω and also a slacking variable Φ. By adding the equality constraint ΘΩΦ=0, (Equation3) is equivalent to: minΘRp×ptr(MˆΘ)+Pλ(|Θ|)+μ2ΦF2s.t.tr(Ω)=1;Ω0,ΘΩΦ=0.The slacking variable Φ plays an essential role in the convergence guarantee of our proposed nonconvex ADMM. Accordingly, the augmented Lagrange function is Lp(Θ,Ω,Φ,Λ)=tr(MˆΘ)+Pλ(|Θ|)+μ2ΦF2+I{tr(Ω)=1;Ω0}+Λ,ΘΩΦ+ρ2ΘΩΦF2.where Λ is the Lagrange variable and ρ>0. Given the initial value (Ω0,Φ0,Λ0), we solve (Θq+1,Ωq+1,Φq+1,Λq+1) iteratively for q=0,1,2, as follows: Θq+1=argminΘ Lp(Θ,Ωq,Φq,Λq)+h2ΘqΘF2Ωq+1=argminΩ Lp(Θq+1,Ω,Φq,Λq)Φq+1=ΦqγΦqLp(Θq+1,Ωq+1,Φ,Λq)Λq+1=Λqρ(Θq+1Ωq+1),where h>0 is chosen to guarantee the convexity of Θ step and γ & ρ are chosen to guarantee the convergence. Their choices will be discussed in Theorem 5. Note that the Φ step and the Λ step are straightforward. Fortunately, we also have the closed-form solutions in both Θ step and Ω step. In the Θ step, we know that minΘ Lp(Θ,Ωq,Φq,Λq)+h2ΘqΘF2=minΘ ρ+h2Θ1ρ+h(ρΩq+ρΦq+hΘqΛqMˆ)F2+Pλ(|Θ|),which can be analytically solved by using the univariate folded concave thresholding rule Sλ(b,c)=argminu12ub2+cPλ(|u|). More specifically, we have Θq+1=Sλ(1ρ+h(ρΩq+ρΦq+hΘqΛqMˆ),ρ+h2).In the Ω step, it is easy to see that minΩ Lp(Θq+1,Ω,Φq,Λq)=minΩ0: tr(Ω)=1 Ω+ΦqΘq+11ρΛqF2,which can be solved by using the projection onto the simplex in the Euclidean space (See Algorithm 1 of Ma Citation2013). To solve Ωq+1, we take the singular value decomposition of Wq=Θq+1Φq+1ρΛq as Wq=Udiag(σ1,,σp)UT. After analytically solving the projection of σ onto the simplex in the Euclidean space. Denoting ξˆ=(ξˆ1,,ξˆp), it has closed-form solution: ξˆ=argminξ 12ξσ22s.t.j=1pξj=t, ξj0.Then we obtain the closed-form solution Ωq+1=Udiag(ξˆ1,,ξˆp)UT.

In the Φ step, we solve the derivative as: ΦqLp(Θq+1,Ωq+1,Φ,Λq)=μΦΛ+ρ(Φ+ΩΘ).Now, we summarise the proposed nonconvex ADMM in Algorithm 1.

4.2. Nonconvex linearised ADMM for sparse KSDR

In this section, we provide a detailed derivation of the nonconvex linearised ADMM algorithm for sparse KSDR. We present the algorithm for solving the leading KSDR vector, and the algorithm for the following vectors follows similarly. To begin with, we introduce new variables Ω1 and Ω2 and slacking variables Φ1 and Φ2. By adding two equality constraints that ΘΩ1Φ1=0,ΘΩ2Φ2=0, (2.7) is equivalent to: min(Θ,Ω1,Ω2,Φ1,Φ2)tr(Σˆyy|ΘTx)+Pλ(|Ω1|)+μ2(Φ1F2+Φ2F2)s.tΘΩ1Φ1=0,ΘΩ2Φ2=0,tr(Ω2)=1,Ω20.The slacking variables Φ1 and Φ2 are essential for the convergence guarantee of our proposed algorithm. Accordingly, the augmented Lagrange function is Lρ(Θ,Ω1,Ω2,Λ1,Λ2)=tr(Σˆyy|Θx)+I{tr(Ω2)=1;Ω20}+μ2(Φ1F2+Φ2F2)+Pλ(|Ω1|)+Λ1,ΘΩ1Φ1+Λ2,ΘΩ2Φ2+ρ2ΘΩ1Φ1F2+ρ2ΘΩ2Φ2F2,where Λ1 and Λ2 are the Lagrange variable and ρ>0.

It is important to point out that tr(Σˆyy|ΘTx) introduces the additional computational challenges since the Θ step does not lead to a closed-form solution. To this end, we introduce a linearised counterpart of this trace function as tr(Σˆyy|Θ(q)Tx)+(ΘΘq)TΘtr(Σˆyy|Θ(q)Tx).After plugging this linearised term, we denote the new Lagrange function as Lρ(Θ,Θq,Ω1,Ω2,Φ1,Φ2,Λ1,Λ2). Given the initial solution (Ω10,Ω20,Φ10,Φ20,Λ10,Λ20), we solve (Θq+1,Ω1q+1,Ω2q+1,Φ1q+1,Φ2q+1,Λ1q+1,Λ2q+1) iteratively for q=0,1,2, as follows: Θq+1=argminΘLρ(Θ,Ω1q,Ω2q,Φ1q,Φ2q,Λ1q,Λ2q)+h2ΘΘqF2Ω1q+1=argminΩ1Lρ(Θq+1,Ω1,Ω2q,Φ1q,Φ2q,Λ1q,Λ2q)Ω2q+1=argminΩ2Lρ(Θq+1,Ω1q+1,Ω2,Φ1q,Φ2q,Λ1q,Λ2q)Φ1q+1=Φ1qγΦ1qLρ(Θq+1,Ω1q+1,Ω2q+1,Φ1q,Φ2q,Λ1q,Λ2q)Φ2q+1=Φ2qγΦ2qLρ(Θq+1,Ω1q+1,Ω2q+1,Φ1q,Φ2q,Λ1q,Λ2q)Λ1q+1=Λ1qρ(Θq+1Ω1q+1)Λ2q+1=Λ2qρ(Θq+1Ω2q+1),where h>0 is chosen to guarantee the convexity of Θ step and γ,ρ are chosen to guarantee the algorithmic convergence (see Theorem 10). In what follows, we present the closed-form solution for each subproblem.

For Θ step, with the linearisation, Θq+1 can be solved by taking derivatives as: ΘLρ(Θ,Θq,Ω1q,Ω2q,Φ1q,Φ2q,Λ1q,Λ2q)+h(ΘΘq)=0.Defining g(Θ)=Gs(Gs+ϵnI)1, then tr(Σˆyy|Θx)=tr(g(Θ)Gy), then the derivatives can be solved by using the chain rule of matrix derivatives: Θtr(Σˆyy|Θx)=vec(Θg(Θ))Tvec(g(Θ)tr(Σˆyy|Θx)),which has a complex closed form. For space consideration, the detailed calculation of closed-form derivatives and its numerical approximation is deferred to the supplement. Based on Θtr(Σyy|Θx), we can obtain the close-form derivative. By solving the linearisation, we obtain the minimiser: q1(Θ,Ω1,Ω2,Φ1,Φ2,Λ1,Λ2)=ρ(Ω1+Φ1+Ω2+Φ2)Λ1Λ2g(Θ)tr(Σˆyy|Θx)+hΘq2ρ+h.For Ω1 step, it is equivalent to solve the subproblem minΩ1Pλ(Θ)+Λ1,ΘΩ1Φ1+ρ2ΘΩ1Φ1F2=minΩ1Ω1(ΘΦ1+Λ1ρ)F2+Pλ(|Ω1|).This can be efficiently solved by using the univariate penalised least squares solution: Ω1q+1=Sλ(ΘΦ1+Λ1ρ,2),where Sλ has been defined in Section 4.1. For Ω2 step, we are solving the optimisation problem: Ω2=argmintr(Ω2)=1Λ2Ω2+ρ2ΘΩ2Φ2F2=argmintr(Ω2)=1 ρ2Ω2+Φ2ΘΛ2/ρF2.It can be solved by projection onto the simplex in the Euclidean space. We have presented exactly the same algorithm in Section 4.1. Thus, we omit it here. For Φ1 and Φ2 steps, we can solve them straightforwardly: Φ1q+1=Φ1q+γ(μΦ1q+Λ1+ρ(Φ1q+Ω1q+1Θq+1))Φ2q+1=Φ2q+γ(μΦ2q+Λ2+ρ(Φ2q+Ω2q+1Θq+1)).Now we can summarise the proposed nonconvex linearised ADMM in Algorithm 2:

5. Theoretical properties

In this section, we establish a theoretical guarantee for both sparse SDR and sparse KSDR estimators. For both SDR and KSDR, first we study the statistical guarantee of 0-constrained estimator, showing their consistency, allowing p diverging as n. Then, we study how our folded concave penalised estimator can approximate 0-constrained estimator and prove the computation guarantee of our proposed nonconvex ADMM and its linearised version.

5.1. Asymptotic properties of sparse SDR

We study the statistical and computational properties of 0-constrained SDR, such as the consistency and convergence rate in this subsection and the computational guarantees of folded concave penalised SDR in this subsection. The 0-constrained SDR θˆ1=argmaxθtr(θTMˆθ)s.t.θ221,θ0tachieves the desired theoretical properties under the high-dimensional setting, while folded concave penalised SDR is computationally feasible and it asymptotically converges to 0-constrained SDR. First, we define the true sufficient dimension vectors. For the selected method, d is the smallest integer such that, the top d eigenvectors of M are θi for i=1,,d, such that yx|θi for i=1,,d. Then by definition, we know θ1,,θd are true sufficient vectors. Now we introduce assumptions thm(A1)–thm(A3).

(A1)

E(ξTx|θ1Tx,,θdTx) is a linear function of θ1Tx, ,θdTx for any ξRp.

(A2)

cov(x|θ1Tx,,θdTx) is degenerate.

(A3)

True sufficient dimension vectors θi satisfy θi0t for i=1,,d.

Note that thm(A1), thm(A2), and thm(A3) are also called the linearity condition, the constant variance condition, and the sparsity condition, respectively. thm(A1) and thm(A2) hold when x follows elliptical distributions. thm(A3) is a standard assumption in sparse SDR literature. See Li (Citation1991), Li and Wang (Citation2007), Li (Citation2007) and Luo et al. (Citation2022) for justifications of these conditions.

Let sin(θa,θb)=1(θaTθb)2. We first study the convergence rate of the first estimated sparse SDR direction θˆ1. Theorem 1 derives this estimation bound based on the general estimation bound that MˆM2=Op(τ(n,p)).

Theorem 5.1

Suppose (A1)(A3), MˆM2=Op(τ(n,p)), λ1>λ2, θ10t, and the spectral decomposition for M exists. Then we have: sin(θˆ1,θ1)=Op(τ(n,p)).

In practice, Op(τ(n,p)) has been derived for each specific SDR method. Corollary 1 discusses the explicit convergence rate, given the explicit convergence rate Op(τ(n,p)) for different methods.

Corollary 5.2

  1. Suppose (A1)(A3), λ1>λ2, θ10t, covariance matrix Σx=Ipp, and p=o(n1/2) as n, then for sparse SIR and sparse DR, we have: sin(θˆ1,θ1)=Op(pn1/2)

  2. Let v[0,1] corresponds the stability of the central curve E[x|y]. When p=o(n), with proper choice of number of slices H=(np)12v+2 and (A3)(A4) in Lin et al. (Citation2018) are satisfied, for sparse SIR, we have: sin(θˆ1,θ1)0asn.Further, if we have Σx=Ipp, we have convergence rate: sin(θˆ1,θ1)=Op((pn)v2v+2)

  3. Let s=|S|p where S={i|θj(i)0 for some j,1jd}. Under the high-dimensional setting that pn, by assuming (A3)(A6) and (S1) in Lin et al. (Citation2018), and Σx=Ipp, then for the sparse DT-SIR, with proper choice of H=(np)12v+2, we have convergence rate: sin(θˆ1,θ1)=Op((sn)v2v+2).

where the stability v of the central curve E[x|y] is defined in definition 1 of Lin et al. (Citation2019). We also provide the complete definition of it in the supplemental material.

It is worth mentioning that our framework is very flexible. By applying different SDR methods to estimate the the M matrix, we can derive the convergence rate under different settings. Let J={j:θ1j0} is the support set of θ1. We further prove the variable selection consistency in Theorem 2.

Theorem 5.3

Under the assumption of Theorem 1, further if minjJ|θ1j|C0τ(n,p) for some constant C0>0, and θ10=t, then supp(θˆ1)=supp(θ1) with probability tending to 1 as n.

Next, recall that in Section 2, for k2, by the same optimisation procedure, we can further estimate θˆk from Mˆk. Next, we study the estimation bound and consistency for θˆk (k2) in Theorem 3.

Theorem 5.4

Suppose that the conditions in Theorems 1 and 2 are satisfied, θk0t, λk>λk+1 and λk=O(1) for 1kd1. Then for 1kd1, we have: Mˆk+1Mk+12=Op(τ(n,p))sin(θˆk+1,θk+1)=Op(τ(n,p)),where Mˆk+1 and Mk+1 are defined in Section 2.

Now, we provide the computational guarantees of our proposed nonconvex ADMM. First, we show that the folded concave penalised estimation well approximates the 0 penalisation. To this end, we use the rescaled folded concave penalty function, as we defined in Section 2. It is easy to see limδ0 Pδ(t)=t0 for any tR. Thus, Pδ(t) converges to |t|0 in the 0 penalisation. For matrix Θ, we define the penalisation Pδ(|Θ|)=i=1pj=1dP(Θij), which is the summation of univariate rescaled penalisation for all entries. Here, we reveal the connection between 0 penalised programming (Equation1) and scaled folded-concave penalised programming (Equation3) in the following theorem.

Theorem 5.5

Suppose we choose {δu} to be sequence such that limuδu=0+. For the programming (Equation3), we choose penalisation function Pδ(t) as we defined. Let Θu be the global minimiser of (Equation3) when we choose δ=δu, and assume Θ is a limit point of {Θu}. Then Θ is also a global minimiser to (Equation1).

Since the nonconvex problem (Equation3) has widely varying curvature, it is nontrivial to prove the algorithmic convergence. By taking advantage of the structured nonconvexity, the next theorem shows that our proposed nonconvex ADMM converges to an ϵ-stationary solution with an explicit iteration complexity bound. In the following theorem, we use vec(Θ) to represent the vectorisation of a matrix from Rp×d to a vector Rpd.

Theorem 5.6

Let L be the Lipschitz bound of loss function, ρ>max(183+613L,6hL) and γ{(13ρ212ρL72L22ρ9ρ212ρL72L2,+)ρ(2+2193L,+)(2ρ13ρ212ρL72L272L2+12ρL9ρ2,2ρ+13ρ212ρL72L272L2+12ρL9ρ2)ρ(183+613L,2+2193L]For any 0<ϵ<min{1L,13h}, with no more than O(1ϵ4) number of iterations, our proposed algorithm obtains an ϵ-stationary solution (Θˆ,Ωˆ,Λˆ), as defined in Jiang et al. (Citation2019), for any matrix ΘRp×p with finite Frobenius norm satisfying that: (vec(ΘΘˆ))T[vec(Mˆ+Pλ(|Θ|)ΛˆI)]ϵΘˆΩˆϵtr(Ωˆ)=1,Ωˆ0.

Remark: When ϵ=0, the constraint exactly implicates the KKT-condition. Thus, the solution will be a stationary solution. We will show numerical performance of such a near-stationary solution in the next section.

5.2. Asymptotic properties of sparse KSDR

In this section, we establish the consistency of kernel sparse dimension reduction under milder assumptions. Based on Fukumizu et al. (Citation2009), we assume following assumptions thm(A4)–thm(A7).

(A4)

For any bounded continuous function g on Y, the mapping ΘEx[Ey|ΘTX[g(y)|ΘTX]2] is continuous on ΘSkp(R) for k=1,2,…,d.

(A5)

For any k=1,2,d, let ΘSkp(R), and PΘ be the probability distribution of the random variable ΘTx on X. The Hilbert space HXS+R is dense in L2(PΘ) for any ΘSkp(R).

(A6)

There is a measurable function ϕ:XR such that E|ϕ(x)|2< and the Lipschitz condition: ks(ΘTx,)ks(Θ~Tx,)Hdϕ(x)D(Θ,Θ~) holds for all Θ,Θ~SPkp(R) and xX for any k=1,2,…,d, and the kernel ks(x,) is defined in Section 3. D is a distance which is compatible with the topology of SPkp(R). And we assume Σyytr,ΣyxHS.

(A7)

The true sparse SDR directions Θ=(θ1,,θd) satisfies ΘSPdp(R).

Note that thm(A1) and thm(A2) are not required for the sparse KSDR, and we rewrite the sparsity assumption thm(A3) in our defined sparse orthogonal space as thm(A7). In thm(A4) to thm(A6), we introduce some mild regularity conditions in RKHS, which are well justified in Fukumizu et al. (Citation2009).

Theorem 5.7

Under assumption (A4) to (A7), and we also suppose true variable set supp(θi) are correlated with a finite number of variables; Suppose our choice of kernel function is continuous and bounded, and suppose the regularisation parameter ϵn satisfies: ϵn0;(npt)1t+2ϵn;pnαsuch that α<1/tasn.Then the functions tr[Σˆyy|θTx] and tr[Σyy|θTx] are continuous on SP1p(R), and it converges in probability that: supθSP1p(R)|tr[Σˆyy|θTx(n)]tr[Σyy|θTx]|0.

Theorem 6 establishes the consistency of the first KDR vector. The growth rate assumption in the theorem says that the growth rate of p is negatively affected by the sparsity level t. When t is smaller, p can diverge faster as n. Because of this uniform convergence, we establish the variable selection consistency by following Theorem 7, allowing a diverging dimension of p.

Theorem 5.8

Under all conditions in Theorem 6, for pnα, such that α<1/t as  n, and k=1,,d, we have: Pr(supp(θˆk)=supp(θk))1.

When variable selection consistency is achieved, we can further construct estimation consistency of θˆ:

Theorem 5.9

If all conditions in Theorem 7 hold, and we assume pnα, such that α<1/t as  n. Then we denote: θˆ=argminθtr(Σyy|θTx)s.t.θ2=1,θ0tAnd define the corresponding matrix Θ~1=θˆ1θˆ1T. When Θ~1 is nonempty, for any arbitrary open set U in S1p(R) with true direction θθTU, we have (4) limnPr(Θ~1U)=1.(4) For following directions, let θ1,,θk correspond to the true first k KDR vectors, and θˆk+1=argminθtr(Σyy|(i=1kθˆiθˆiT+θθT)Tx)s.t.θ2=1,θ0t.Denote Θ~k+1=i=1kθˆiθˆiT+θˆk+1θˆk+1T, and the true space Θk+1=i=1k+1(θiθi)T. When θˆi are all not nonempty for i=1,,k, for any arbitrary open set U in Sk+1p(R) with Θk+1U, we have (5) limnPr(Θ~k+1U)=1.(5)

Theorem 8 extends the KSDR consistency results from Fukumizu et al. (Citation2009) to a diverging p case. Making use of the 0-constrained parameter space, we have both variable selection and estimation consistency under certain conditions.

Following the statistical property, Theorem 9 shows the asymptotic equivalency of folded concave constraint and 0 constraint. Theorem 10 shows that our nonconvex linearised ADMM guarantees the convergence to an ϵ-stationary solution.

Similar with Theorem 4, for Theorem 9, we consider minΘRp×dtr(Σyy|ΘTx)+γ1T|Θ|01tr(Θ)=1;Θ0minΘRp×dtr(Σyy|ΘTx)+γPδ(|Θ|)tr(Θ)=1;Θ0,where Pδ(|Θ|) is defined in Section 2. Considering d = 1, we can also solve sufficient directions iteratively, as introduced in Section 2.

Theorem 5.10

Suppose we choose {δu} to be sequence such that limuδu=0+. For the programming (4.12), we choose penalisation function Pδ(t) as we defined. Let Θu be the global minimiser of (4.12) when we choose δ=δu, and Θ is a limit point of {Θu}. Then Θ is also a global minimiser to (4.11).

Parallel to Theorem 5, we show the property for the ϵ-stationary solution of our proposed nonconvex ADMM in Theorem 10.

Theorem 5.11

Let L be the Lipschitz constant of the loss function. Let ρ>max(183+613L,6hL)and γ{(13ρ212ρL72L22ρ9ρ212ρL72L2,+)ρ(2+2193L,+)(2ρ13ρ212ρL72L272L2+12ρL9ρ2,2ρ+13ρ212ρL72L272L2+12ρL9ρ2)ρ(183+613L,2+2193L].For any 0<ϵ<min{1L,13h}, with no more than O(1ϵ4) number of iterations, our proposed algorithm obtains an ϵ-stationary solution (Θˆ,Ωˆ1,Ωˆ2,Λˆ1,Λˆ2), as defined in Jiang et al. (Citation2019), for any ΘRp×p with finite Frobenius norm,satisfying that (vec(ΘΘˆ))T[vec(Θtr(Σˆyy|ΘˆTx)(Λˆ1+Λˆ2)I)]ϵ(Ω1Ωˆ1)T[Pλ(Ω1)(Λˆ1+Λˆ2)I]ϵΘˆΩˆ1ϵΘˆΩˆ2ϵtr(Ωˆ2)=1,Ωˆ20.

6. Numerical properties

In this section, we evaluate the performance of the sparse SIR/DR/KSDR estimator in our semidefinite relaxation framework, the DT-SIR estimator proposed by Lin et al. (Citation2019), the variable selection of penalised linear regression with SCAD penalisation and marginal correlation screening.

Here we compare the numerical performance of these methods in four data generation models as follows with sample size n = 100.

  • Model 1: y=0.5(x1+x2)0.5+(0.5(x1x2)+1.5)2+(1+0.5(x1x2))2+σ2e, where eN(0,1), xN(0,I10), and σ=0.2,0.4,0.8 for Model 1a–1c.

  • Model 2: y=12(x12+x22)+σ2e, where eN(0,1), xN(0,I20), and σ=0.1,0.2,0.3 for Model 2a–2c.

  • Model 3: y=12(x1+1)2+σ2e, where eN(0,1), xN(0,I40), and σ=0.2,0.4,0.8 for Model 3a–3c.

  • Model 4: y=sin2(πx1+1)+σ2e, where σ=0.2,0.4,0.8 for Model 4a-4c, eN(0,1), and x is uniformly distributed on {xR10| xi>0.7,for i=1,2,10}

To compare the numerical performance of these methods, we replicate our data generation procedure 100 times and apply all the methods to these 100 replications. We compute the estimated sparse SDR vector θˆk for k=1,2,,d for each method each time. Then, we evaluate the performance of different methods using multiple R2, True Positive Rate (TPR) and True Negative Rate (TNR). The multiple R2 is defined as R2=1dk=1dPspan(Θ)θˆk22. True Positive Rate(TPR) is defined as the fraction of variables correctly specified as non-zero in at least one out of SDR vectors, and True Negative Rate(TNR) is the fraction of variables correctly specified as zero in all SDR vectors.

The simulation results are summarised in the following Table . In the following Tables  and , we report the mean and standard deviation for these three indicators, with sparse SIR/DR/KSDR, DT-SIR, and SCAD penalised regression. The marginal correlation method doesn't have R2, and we report its true positive/negative rate in Table . The tuning procedure is as follows. For the proposed methods, we choose the penalisation parameter λ via the oracle tuning procedure. Specifically, we construct two independent datasets with the same sample size n = 100 for training and tuning respectively. For each λ, we estimate the parameter Θ on the training dataset and evaluate the corresponding loss function on the tuning dataset. We choose the best penalisation parameter to minimise the loss function on the tuning dataset. In practice, we may use the cross-validation to find the proper penalisation parameter.

Table 1. Summary of R2, TP and TN for model 1 and 2.

Table 2. Summary of R2, TP and TN for model 3 and 4.

Table 3. Summary of TP and TN for marginal correlation method.

For implementation, Sparse SIR/DR/KSDR are implemented using our proposed nonconvex ADMM, DT-SIR is implemented using R package ‘LassoSIR’, and SCAD penalised regression is implemented using R package ‘ncvreg’. For the marginal correlation method, we compute the correlation between y and each variable xi, then rank the absolute value of the correlation and select the first d variables, where d is the true dimension.

We also compare the performance of R2 for kernel dimension reduction in Fukumizu et al. (Citation2009) in Model 4(a, b, c) as a benchmark. The average R2 are 0.642, 0.546 and 0.513 for Model 4(a), 4(b) and 4(c) separately. The results are comparable with SKDR results. SKDR performs slightly better given it eliminates some noise from irrelevant variables, thanks to the penalisation. Given kernel dimension reduction doesn't provide a sparse solution, and doesn't provide a meaningful TPR and TNR, we didn't include them in the main table.

We draw the following conclusions from the results: (1) sparse SIR and DT-SIR are compatible in most cases, while sparse SIR outperforms DT-SIR in R2 in general. In model 1 and model 3, when the assumption for SIR holds, our sparse SIR is almost perfect. DT-SIR also performs well in these two settings but sometimes includes many irrelevant variables and leads to an unsatisfying TNR and R2. (2) The sparse DR is also compatible with SIR in general. Although it may not outperform SIR in model 1 and model 3, when the function is symmetric over x in model 2, SIR fails to recover the true space, while sparse DR is very stable in this case. (3) sparse KSDR performs best in model 4 when linearity conditions thm(A1) and thm(A2) don't hold, and both SIR and DR lose consistency. While sparse KSDR also performs fairly well compared to SIR and DR in models 1–3, when the linearity condition holds and no matter whether the function is symmetric over x. (4) The penalised regression fails in model 2 and model 4 when the relationship between x and y cannot be approximated by linear regression. Also, regression cannot identify the linear combination θ, which is also of interest. (5) Similarly, the marginal correlation method fails in model 2 when the correlation between x and y is completely non-linear, and they have population correlation zero. What's more, using correlation-based methods also cannot identify the linear combination θ of our interest.

7. Application to microbiome data analysis

In this section, we apply our method to analyse the host transcriptomic data and 16S rRNA gene sequencing data from paired biopsies from IPAA patients with UC and familial adenomatous polyposis (Morgan et al. Citation2015). Morgan et al. (Citation2015) extracted 5 principal components of microbiome 16s rRNA gene sequencing data, and analysed the relationship between these PCs and the PCs from host transcriptomic data. However, the result is hard to interpret since PCs are linear combinations of all host transcriptomics. Thus we apply our proposed sparse KSDR to explore the more interpretable sparse linear combinations which are relevant to these microbiome PCs.

First we select 20 representative genes of host transcriptomic data from two groups. The first group of genes are those most different from two locations (PPI and Porch) of a patient by Kolmogorov–Smirnov test. The second group of genes are important genes pointed out in Morgan et al. (Citation2015), which are important to the pathogenesis of inflammatory bowel disease (IBD). And we treat 9 principal components of microbiome 16s rRNA gene sequencing as y in our model separately, and apply our kernel sparse diemsnion reduction estimation on the host transcriptomic data of PPI for each patient, with sample size n = 196 and p = 20. The selected genes and their coefficients are summarised in the following Table .

Table 4. Selected genes and corresponding coefficients using KSDR.

In the result, leading PCs (PC1-PC5) are mainly correlated with protein-related genes such as TLR1 and MMEL1 It reflects that these PCs may be mainly composed of antibiotic-signature microbiome. Several genes that have been widely studied that are important to the pathogenesis of IBD is also selected. For example, IL10 is selected in PC1, MUC6 is selected in PC2, 3, 4, and 5, and IL1RN is selected in PC1, 3, 4. Among them, IL10 is well studied that its effectiveness in signalling a subgroup of patients with IBD (Begue et al. Citation2011). MUC6 may have a role in epithelial wound healing after mucosal injury in IBD in addition to mucosal protection (Buisine et al. Citation2001). IL1RN has a well-established pathological role in IBD (Stokkers et al. Citation1998). Our result emphasises the effects of these genes may be actively related to microbiome RNA environment of IBD patients.

8. Conclusion

Motivated by the desired property of sparse SDR, we propose the nonconvex estimation scheme for sparse KSDR. In aspect of statistical property, we prove the asymptotic consistency for both estimators. In the aspect of optimisation, we show that they can both be solved by nonconvex ADMM algorithm efficiently with a provable convergence guarantee. We also show the practical usefulness of our proposed methods in various simulation settings and real data analysis.

Supplemental material

Supplemental Material

Download PDF (365.5 KB)

Acknowledgments

The authors were grateful for the insightful comments and suggestions from the editor and reviewers.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

The authors were partially supported by the National Institutes of Health (NIH) grant 1R01GM152812 and National Science Foundation (NSF) grants DMS-1811552, DMS-1953189, CCF-2007823, and DMS-2210775.

References

  • Amini, A.A., and Wainwright, M.J. (2008), ‘High-Dimensional Analysis of Semidefinite Relaxations for Sparse Principal Components’, in Information Theory, 2008. ISIT 2008. IEEE International Symposium on, IEEE, pp. 2454–2458.
  • Baker, C.R. (1973), ‘Joint Measures and Cross-Covariance Operators’, Transactions of the American Mathematical Society, 186, 273–289.
  • Begue, B., Verdier, J., Rieux-Laucat, F., Goulet, O., Morali, A., Canioni, D., Hugot, J.-P., Daussy, C., Verkarre, V., Pigneur, B., and Fischer, A. (2011), ‘Defective Il10 Signaling Defining a Subgroup of Patients with Inflammatory Bowel Disease’, Official Journal of the American College of Gastroenterology ACG, 106(8), 1544–1555.
  • Bondell, H.D., and Li, L. (2009), ‘Shrinkage Inverse Regression Estimation for Model-free Variable Selection’, Journal of the Royal Statistical Society Series B: Statistical Methodology, 71(1), 287–299.
  • Buisine, M., Desreumaux, P., Leteurtre, E., Copin, M., Colombel, J., Porchet, N., and Aubert, J. (2001), ‘Mucin Gene Expression in Intestinal Epithelial Cells in Crohn's Disease’, Gut, 49(4), 544–551.
  • Chen, X., Zou, C., and Cook, R.D. (2010), ‘Coordinate-independent Sparse Sufficient Dimension Reduction and Variable Selection’, The Annals of Statistics, 38(6), 3696–3723.
  • Cook, R.D., and Weisberg, S. (1991), ‘Comment’, Journal of the American Statistical Association, 86(414), 328–332.
  • d'Aspremont, A., Ghaoui, L.E., Jordan, M.I., and Lanckriet, G.R. (2005), ‘A Direct Formulation for Sparse PCA Using Semidefinite Programming’, in Advances in Neural Information Processing Systems, pp. 41–48.
  • Fan, J., and Li, R. (2001), ‘Variable Selection Via Nonconcave Penalized Likelihood and Its Oracle Properties’, Journal of the American Statistical Association, 96(456), 1348–1360.
  • Fan, J., Xue, L., and Yao, J. (2017), ‘Sufficient Forecasting Using Factor Models’, Journal of Econometrics, 201(2), 292–306.
  • Fan, J., Xue, L., and Zou, H. (2014), ‘Strong Oracle Optimality of Folded Concave Penalized Estimation’, The Annals of Statistics, 42(3), 819–849.
  • Frommlet, F., and Nuel, G. (2016), ‘An Adaptive Ridge Procedure for L 0 Regularization’, PloS One, 11(2), e0148620.
  • Fukumizu, K., Bach, F.R., and Jordan, M.I. (2009), ‘Kernel Dimension Reduction in Regression’, The Annals of Statistics, 37(4), 1871–1905.
  • Jiang, B., Lin, T., Ma, S., and Zhang, S. (2019), ‘Structured Nonconvex and Nonsmooth Optimization: Algorithms and Iteration Complexity Analysis’, Computational Optimization and Applications, 72(1), 115–157.
  • Li, K.-C. (1991), ‘Sliced Inverse Regression for Dimension Reduction’, Journal of the American Statistical Association, 86(414), 316–327.
  • Li, L. (2007), ‘Sparse Sufficient Dimension Reduction’, Biometrika, 94(3), 603–613.
  • Li, B., Chun, H., and Zhao, H. (2012), ‘Sparse Estimation of Conditional Graphical Models with Application to Gene Networks’, Journal of the American Statistical Association, 107(497), 152–167.
  • Li, B., and Wang, S. (2007), ‘On Directional Regression for Dimension Reduction’, Journal of the American Statistical Association, 102(479), 997–1008.
  • Lin, Q., Zhao, Z., and Liu, J.S. (2018), ‘On Consistency and Sparsity for Sliced Inverse Regression in High Dimensions’, The Annals of Statistics, 46(2), 580–610.
  • Lin, Q., Zhao, Z., and Liu, J.S. (2019), ‘Sparse Sliced Inverse Regression Via Lasso’, Journal of the American Statistical Association, 114(528), 1726–1739.
  • Liu, B., Zhang, Q., Xue, L., Song, P.X.-K., and Kang, J. (2024), ‘Robust High-dimensional Regression with Coefficient Thresholding and Its Application to Imaging Data Analysis’, Journal of the American Statistical Association, 119, 715–729.
  • Luo, W., Xue, L., Yao, J., and Yu, X. (2022), ‘Inverse Moment Methods for Sufficient Forecasting Using High-dimensional Predictors’, Biometrika, 109(2), 473–487.
  • Ma, S. (2013), ‘Alternating Direction Method of Multipliers for Sparse Principal Component Analysis’, Journal of the Operations Research Society of China, 1(2), 253–274.
  • Mackey, L.W. (2009), ‘Deflation Methods for Sparse PCA’, in Advances in Neural Information Processing Systems, pp. 1017–1024.
  • Morgan, X.C., Kabakchiev, B., Waldron, L., Tyler, A.D., Tickle, T.L., Milgrom, R., Stempak, J.M., Gevers, D., Xavier, R.J., and Silverberg, M.S. (2015), ‘Associations Between Host Gene Expression, the Mucosal Microbiome, and Clinical Outcome in the Pelvic Pouch of Patients with Inflammatory Bowel Disease’, Genome Biology, 16(1), 67.
  • Neykov, M., Lin, Q., and Liu, J.S. (2016), ‘Signed Support Recovery for Single Index Models in High-dimensions’, Annals of Mathematical Sciences and Applications, 1(2), 379–426.
  • Shi, L., Huang, X., Feng, Y., and Suykens, J. (2019), ‘Sparse Kernel Regression with Coefficient-based Lq-regularization’, Journal of Machine Learning Research, 20(116), 1–44.
  • Stokkers, P., Van Aken, B., Basoski, N., Reitsma, P., Tytgat, G., and Van Deventer, S. (1998), ‘Five Genetic Markers in the Interleukin 1 Family in Relation to Inflammatory Bowel Disease’, Gut, 43(1), 33–39.
  • Tan, K., Shi, L., and Yu, Z. (2020), ‘Sparse SIR: Optimal Rates and Adaptive Estimation’, The Annals of Statistics, 48(1), 64–85. https://doi.org/10.1214/18-AOS1791.
  • Tan, K.M., Wang, Z., Zhang, T., Liu, H., and Cook, R.D. (2018), ‘A Convex Formulation for High-dimensional Sparse Sliced Inverse Regression’, Biometrika, 105(4), 769–782.
  • Wold, S., Esbensen, K., and Geladi, P. (1987), ‘Principal Component Analysis’, Chemometrics and Intelligent Laboratory Systems, 2(1-3), 37–52.
  • Wu, C., Miller, J., Chang, Y., Sznaier, M., and Dy, J. (2019), ‘Solving Interpretable Kernel Dimensionality Reduction’, in Advances in Neural Information Processing Systems, pp. 7915–7925.
  • Ying, C., and Yu, Z. (2022), ‘Fréchet Sufficient Dimension Reduction for Random Objects’, Biometrika, 109(4), 975–992.
  • Yu, X., Yao, J., and Xue, L. (2022), ‘Nonparametric Estimation and Conformal Inference of the Sufficient Forecasting with a Diverging Number of Factors’, Journal of Business & Economic Statistics, 40(1), 342–354.
  • Zhang, C.-H. (2010), ‘Nearly Unbiased Variable Selection Under Minimax Concave Penalty’, The Annals of Statistics, 38(2), 894–942.
  • Zhang, Q., Li, B., and Xue, L. (2024), ‘Nonlinear Sufficient Dimension Reduction for Distribution-on-distribution Regression’, Journal of Multivariate Analysis, 202, 105302.
  • Zhang, Q., Xue, L., and Li, B. (2024), ‘Dimension Reduction for Fréchet Regression’, Journal of the American Statistical Association, in press.
  • Zou, H. (2006), ‘The Adaptive Lasso and Its Oracle Properties’, Journal of the American Statistical Association, 101(476), 1418–1429.
  • Zou, H., Hastie, T., and Tibshirani, R. (2006), ‘Sparse Principal Component Analysis’, Journal of Computational and Graphical Statistics, 15(2), 265–286.
  • Zou, H., and Xue, L. (2018), ‘A Selective Overview of Sparse Principal Component Analysis’, Proceedings of the IEEE, 106(8), 1311–1320.