624
Views
0
CrossRef citations to date
0
Altmetric
Articles

Dimension reduction with expectation of conditional difference measure

ORCID Icon &
Pages 188-201 | Received 26 Sep 2022, Accepted 14 Feb 2023, Published online: 13 Mar 2023

Abstract

In this article, we introduce a flexible model-free approach to sufficient dimension reduction analysis using the expectation of conditional difference measure. Without any strict conditions, such as linearity condition or constant covariance condition, the method estimates the central subspace exhaustively and efficiently under linear or nonlinear relationships between response and predictors. The method is especially meaningful when the response is categorical. We also studied the n-consistency and asymptotic normality of the estimate. The efficacy of our method is demonstrated through both simulations and a real data analysis.

1. Introduction

With the increase of dimensionality, the volume of the space increases so fast that the available data become sparse (Bellman, Citation1961). The sparsity is a problem to many statistical methods since not enough data is available to do model fitting or make inference. Because of the situations discussed above, many classical models derived from oversimplified assumptions and nonparametric methods are no longer reliable. Therefore, dimension reduction that reduces the data dimension but retains (sufficient) important information can play a critical role in high-dimensional data analysis. With dimension reduction as a pre-process, often the number of reduced dimensions is small. Hence, parametric and nonparametric modelling methods can then be readily applied to the reduced data.

Sufficient dimension reduction is one approach to do dimension reduction, which focuses on finding a linear transformation of the predictor matrix, so that given that transformation, the response and the predictor are independent (Cook, Citation1994Citation1996; Li, Citation1991). For the past 25 years, sufficient dimension reduction is a hot topic and many methods have been developed to estimate the central subspace (Cook, Citation1996). These methods can be classified into three groups: inverse, forward and joint regression methods. Inverse regression methods use the regression of X|Y, and require certain conditions on X, such as linearity condition and/or constant covariance condition. Specific methods include sliced inverse regression (SIR; Li, Citation1991), sliced average variance estimation (SAVE; Cook & Weisberg, Citation1991) and directional regression (DR; Li & Wang, Citation2007). Also see (Cook & Ni, Citation2005; Cook & Zhang, Citation2014; Dong & Li, Citation2010; Fung et al., Citation2002; Zhu & Fang, Citation1996). The forward regression methods include the minimum average variance estimation (MAVE; Xia et al., Citation2002), its variants, (Xia, Citation2007; Wang & Xia, Citation2008), average derivative estimate (Härdle & Stoker, Citation1989; Powell et al., Citation1989), and structure adaptive method (Hristache et al., Citation2001; Ma & Zhu, Citation2013). The forward methods require nonparametric approaches such as kernel smoothing. Joint regression methods require the joint distribution of (Y,X), and methods include principal hessian direction (PHD; Cook, Citation1998; Li, Citation1992), and the Fourier method (Zeng & Zhu, Citation2010; Zhu & Zeng, Citation2006). They require either smoothing techniques or stronger conditions.

In this article, we develop a new sufficient dimension reduction method based on the measure proposed in Yin and Yuan (Citation2020) to estimate the central subspace. It involves the technique of slicing the range of Y into several intervals, which is similar to the classical inverse approaches, such as SIR and SAVE, but it does not require any linearity or constant covariance condition and can exhaustively recover the central subspace without smoothing requirement. On the other hand, comparing to other sufficient dimension reduction methods using distance measures, such as Sheng and Yin (Citation2016), our method makes more sense when the response Y is categorical with no numerical meaning because the measure used in this article is properly defined for categorical variables.

This article is organized as follows: Section 2 introduces the new sufficient dimension reduction method, the algorithm, theoretical properties and the method of estimating the structural dimension d. In Section 3, we show the simulation studies, while Section 4 presents the real data analysis and a brief discussion is followed in Section 5.

2. Methodology

2.1. A measure of divergence

In Yin and Yuan (Citation2020), they proposed a new measure of divergence for testing independence between two random vectors. Let XRp and YRq, where p and q are positive integers. Then the measure between X and Y with finite first moments is a nonnegative number, C(X|Y), defined by (1) C2(X|Y)=Rp|fX|Y(t)fX(t)|2w(t)dt,(1) where fX|Y and fX stand for the characteristic functions of X|Y and X, respectively. Let |f|2=ff¯ for a complex-valued function f, with f¯ being the conjugate of f. The weight function w(t) is a specially chosen positive function. More details of w(t) can be found in Yin and Yuan (Citation2020). They also give an equivalent formula as (2) C2(X|Y)=E|XXY|E|XYXY|=E|XX|E|XYXY|,(2) where the expectation is over all random vectors. For instance, the last expectation is first taking the conditional expectation given Y, then over Y. (X,Y) is an independent and identically distributed copy of (X,Y). XY denotes a random variable distributed as X|Y, XY denotes a random variable distributed as X|Y and XY denotes a random variable distributed as X|Y with Y=Y.

One property of C2(X|Y) is that it equals 0 if and only if the two random vectors are independent (Yin & Yuan, Citation2020). This property makes it possible that C2(X|Y) can be used as a sufficient dimension reduction tool. What's more, the measure works well for both continuous and categorical Y and because C2(X|Y) is well defined for categorical Y, our method is particularly meaningful when the class index of dataset does not have numerical meaning, where other measures do not attain similar advantage.

2.2. Review of sufficient dimension reduction

Let γ be a p×q matrix with qp, and be the independence notation. The following conditional independence leads to the definition of sufficient dimension reduction: (3) YX|γX,(3) where (Equation3) indicates that the regression information of Y given X is completely contained in the linear combinations of X, γX. The column space of γ in (Equation3), denoted by S(γ), is called a dimension reduction subspace.

If the intersection of all dimension reduction subspace is itself a dimension reduction subspace, then it is called the central subspace (CS), and it is denoted by SY|X (Cook, Citation1994Citation1996; Li, Citation1991)). Under mild conditions, CS exists (Cook, Citation1998; Yin et al., Citation2008). Throughout the article, we assume CS exists, which is unique. Furthermore, let d denote the structural dimension of the CS, and let ΣX be the covariance matrix of X, which is assumed to be nonsingular. Our primary goal is to identify the CS by estimating d and a p×d basis matrix of CS.

Here we introduce some notations needed in the following sections. Let β be a matrix and S(β) be the subspace spanned by the column vectors of β. dim(S(β)) is the dimension of S(β). Pβ(ΣX) denotes the projection operator, which projects onto S(β) with respect to the inner product a,b=aΣXb, that is, Pβ(ΣX)=β(βΣXβ)1βΣX. Let Qβ(ΣX)=IPβ(ΣX), where I is the identity matrix.

2.3. The new sufficient dimension reduction method

Let β be a p×d0 arbitrary matrix, where 1d0p. Under mild conditions, it can be proved that solving (Equation4) will yield a basis of the central subspace. (4) maxβ:βΣXβ=Id01d0pC2(βX|Y).(4) Here the squared divergence between βX and Y is defined as C2(βX|Y)=EY[Rd0+1|fβX|Y(t)fβX(t)|2w(t)dt]. The conditions E|X|<, E|Y|< and E|XY|< in Yin and Yuan (Citation2020) guarantee that the C2(βX|Y) is finite. Thus throughout the article, we assume they hold. The constraint βΣXβ=Id0 in (Equation4) is needed due to the property C2(cβX|Y)=|c|C2(βX|Y) for any constant c (Yin & Yuan, Citation2020).

The following propositions justify our estimator. They ensure that if we maximize C2(βX|Y) with respect to β under the constraint and some mild conditions, the solution indeed spans the CS.

Proposition 2.1

Let η be a p×d basis matrix of the CS, β be a p×d1 matrix with d1d, dim(S(β))=d1, ηΣXη=Id and βΣXβ=Id1. If S(β)S(η), then C2(βX|Y)C2(ηX|Y). The equality holds if and only if S(β)=S(η).

Proposition 2.2

Let η be a p×d basis matrix of the CS, β be a p×d2 matrix with ηΣXη=Id and βΣXβ=Id2. Here d2 could be bigger, less or equal to d. Suppose Pη(ΣX)XQη(ΣX)X and S(β)S(η). Then C2(βX|Y)<C2(ηX|Y).

Proposition 2.1 indicates that if S(β) is a subspace of the CS, then C2(βX|Y) is less than or equal to C2(ηX|Y) and the equality holds if and only if β is a basis matrix of the CS, i. e., S(β)=S(η). Proposition 2.2 implies that if S(β) is not a subspace of the CS, then C2(βX|Y) is less than C2(ηX|Y) under a mild condition. The above two propositions show that we can identify the CS by maximizing C2(βX|Y) with respect to β under the quadratic constraint. The condition in Proposition 2.2, Pη(ΣX)XQη(ΣX)X, was discussed in Sheng and Yin (Citation2013), where they showed the condition is not very strict and can be satisfied asymptotically when p is reasonably large. Proofs for Propositions 2.1 and 2.2 are in the Appendix A.

2.4. Estimating the CS when d is specified

In this section, we develop an algorithm for estimating the CS when the structural dimension d is known. Let (X,Y)={(Xi,Yi),i=1,,n} be a random sample from (X,Y) and let β be a p×d matrix. For the purpose of slicing, these n observations can be equivalently written as Xy,ky,Yy,ky, where y=1,,H, ky=1,,ny, where ny is the number of observations for slice y. The empirical version of C2(βX|Y) denoted by Cn2(βX|Y) is defined as: (5) Cn2(X|Y)=1n2k,l=1n,n|XkXl|1ny=1H1nyky,ly=1ny,ny|Xy,kyXy,ly|.(5) Here || is the Euclidean norm in the respective dimension. Let ΣˆX be the estimate of ΣX. Then an estimated basis matrix of the CS, say ηn, is (6) ηn=argmaxβ:βΣˆXβ=IdCn2(βX|Y).(6) An outline of the algorithm is as follows.

  1. Obtain the initials η(0): any existing sufficient dimension reduction method, such as SIR (Li, Citation1991) or SAVE (Cook & Weisberg, Citation1991) can be used to obtain the initial.

  2. Iterations: let η(k) be the estimate of η in the kth iteration. In order to search for the η(k+1), the interior-point approach is applied. In the interior-point approach, the original optimization problem in (6) is replaced by a sequence of barrier subproblems, which are solved approximately by two powerful tools: sequential quadratic programming and trust region techniques. In this process, one of two main types of steps is used at each iteration: a direct step or a conjugate gradient step. By default, the algorithm tries a direct step first. If a direct step fails, it attempts a conjugate gradient step. More extensive descriptions of the interior-point approach are in Byrd et al. (Citation2000Citation1999) and Waltz et al. (Citation2006).

  3. Check convergence: if the difference between η(k) and η(k+1) is smaller than the preset tolerance value, such as 106, then stop the iteration and set ηn=η(k+1); otherwise, set k: = k + 1 and go to step 2.

In the above algorithm, we assume the structural dimension d is known, which is not true in practice. We will propose an approach to estimate d in Section 2.6.

2.5. Theoretical properties

Proposition 2.3

Let ηn=argmaxβΣˆXβ=IdCn2(βX|Y), and η be a basis matrix of the CS with ηΣXη=Id. Under the condition Pη(ΣX)XQη(ΣX)X, ηn is a consistent estimator of a basis of the CS, that is, there exists a rotation matrix Q: QQ=Id, such that ηnPηQ.

Furthermore, we can prove the n-consistency and asymptotic normality of the estimator as stated below.

Proposition 2.4

Let ηn=argmaxβΣˆXβ=IdCn2(βX|Y), and η be a basis matrix of the CS with ηΣXη=Id. Under the regularity conditions in the supplementary file, there exists a rotation matrix Q: QQ=Id such that n[vec(ηn)vec(ηQ)]DN(0,V11(ηQ)), where V11(ηQ) is the covariance matrix given in the supplementary file.

Proofs of Propositions 2.3 and 2.4 are in Appendices B and C, respectively.

2.6. Estimating structural dimension d

There is a rich literature of discussing determining d in sufficient dimension reduction, for example, some nonparametric methods such as Wang and Xia (Citation2008), Ye and Weiss (Citation2003) and Luo and Li (Citation2016) and some eigen-decomposition-based methods, for examples, Luo et al. (Citation2009), and Wang et al. (Citation2015). Here we apply the kNN method proposed in Wang et al. (Citation2015).

Given a sample {(Xi,Yi),1in}, d can be estimated by the following kNN procedure.

  1. Find the k-nearest neighbours for each data point (Xi,Yi) using Euclidean distance. Denote the k-nearest neighbours of (Xi,Yi) as (Xi(j),Yi(j)),1jk.

  2. For each data point (Xi,Yi), apply the method proposed in this article to its k-nearest neighbours and estimate βiˆ. Here the dimension of βiˆ is set as 1.

  3. Calculate the eigenvalues of the matrix i=1nβˆiβˆi. Denote and order them as λ1λ2λp.

  4. Calculate the ratios ri=λi/λi+1,1ip1. The dimension d is estimated as the largest ri happens in the sequence.

In the last step, this maximal eigenvalue ratio criterion was suggested by Luo et al. (Citation2009) and was also used by Li and Yin (Citation2009) and Sheng and Yuan (Citation2020).

3. Simulation studies

Estimation accuracy is measured by the distance Δm(Sˆ,S)=∥PSˆPS (Li et al., Citation2005), where S is the real d-dimensional CS of Rp, Sˆ is the estimate, PS, PSˆ are the orthogonal projections onto S and Sˆ, respectively and is the maximum singular value of a matrix. The smaller the Δm is, the better the estimate is. Also a method works better if it has a smaller standard error of Δm. In the following, the first three examples show the nice performance of the proposed method in terms of both continuous and categorical response, assuming we already know the dimension d. The last example illustrates the performance of estimating dimension d using the kNN procedure in Section 2.6.

Example 3.1

Consider the Model 1 Y=(β1X)2+(β2X)+0.1ϵ, where XN(0,Ip), ϵN(0,1) and ϵ is independent of X. β1=(1,0,,0), and β2=(0,1,,0). We compare DCOV (Sheng & Yin, Citation2016), SIR (Li, Citation1991), SAVE (Cook & Weisberg, Citation1991) and LAD (Cook & Forzani, Citation2009) with our method ECD with 10 slices.

Table  shows the average estimation accuracy (Δ¯m) and its standard error (SE) under different (n,p) combinations and 500 replications. Note that ECD performs consistently better than other methods, under all the different (n,p) combinations.

Table 1. Estimation accuracy report for Model 1.

Example 3.2

This model was studied by Cui et al. (Citation2011). It has binary responses 1 and 0, which have no numerical meaning. Model 2 is P(Y=1|X)=exp(g(β3X))1+exp(g(β3X)), where g(β3X)=exp(5β3X2)/{1+exp(5β3X3)}1.5, XN(0,Ip) and β3=(2,1,0,,0)/5. The simulation results are reported in Table .

Table 2. Estimation accuracy report for Model 2.

Example 3.3

Consider another binary-response model, Model 3: Y=sign(sin(β4X)β5X+0.2ϵ), where X follows the multivariate uniform distribution unif(2,2)p, ϵN(0,1), and ϵ is independent of X, β4=(1,0,,0), and β5=(0,1,0,,0). The simulation results are reported in Table .

Table 3. Estimation accuracy report for Model 3.

From the simulation results, we find ECD method outperforms other methods when the response is continuous. When the response is categorical, it also performs better than SIR, SAVE and LAD and its performance is comparable to DCOV. To be more specific, the accuracy of ECD and DCOV is very close as sample size n gets large when the response Y is categorical. On the other hand, the computation speed of ECD is faster than that of DCOV due to its slicing technique in calculating Cn2(βX|Y). For example, when (n,p)=(200,6), ECD is about 2.7 times faster than DCOV under Model 1 and 2, and about 3.6 times faster under Model 3. Overall, ECD is superior to other methods.

Example 3.4

Estimating d. We test the performance of the kNN procedure in Section 2.6 based on Model 1 and Model 2. Table  shows that the kNN procedure can estimate dimension d very precisely, no matter the response is continuous or categorical.

Table 4. Accuracy of estimating d with kNN procedure.

4. Real data analysis

To further investigate the performance of our method, we apply it to the Pen Digit database from the UCI machine-learning repository. The data contains 10,992 samples of hand-written digits {0,1,,9}. The digits were collected from 44 writers and every writer was asked to write 250 random digits. Every digit is represented as a 16-dimensional feature vector. The 44 writers are divided into two groups, in which 30 are used for training, while others are used for testing. The data set and more details are available at archive.ics.uci.edu/ml/machine-learning-databases/pendigits/.

We choose the 0's, 6's and 9's, three hardly classified digits, as an illustration. In this subset of the database, there are 2,219 cases in the training data and 1,035 cases in the test data. We apply the dimension reduction methods to the 16-dimensional predictor vector for the training set, which serves as a preparatory step for the three-group classification problem. Because the response has three slices, SIR estimates only two directions in the dimension reduction subspace. The other methods, SAVE, DCOV and ECD, all estimate three directions. Figure  presents the two-dimensional plot of (SIR1, SIR2) and Figure  shows the three dimensional plots of (SAVE1, SAVE2, SAVE3), (DCOV1, DCOV2, DCOV3) and (ECD1, ECD2, ECD3). SIR provides only location separation of the three groups. SAVE implies there are covariance differences among three groups, but no clear location separation is provided. Both DCOV and ECD get the location separation and covariance differences, but ECD presents a more clear separation among the three groups.The three-dimensional plot of (ECD1, ECD2, ECD3) gives a comprehensive demonstration of the different features of the three groups.

Figure 1. 2D-plot for the two predictors estimated by SIR.

Figure 1. 2D-plot for the two predictors estimated by SIR.

Figure 2. 3D-plots for the three predictors estimated by SAVE, DCOV and ECD.

Figure 2. 3D-plots for the three predictors estimated by SAVE, DCOV and ECD.

5. Discussion

In this article, we proposed a new sufficient dimension reduction method. We studied its asymptotic properties and introduced the kNN procedure to estimate the structural dimension d. The numerical studies show that our method can estimate the CS accurately and efficiently. In the future, we consider to develop a variable selection method by combining our method with the penalized method such as LASSO (Tibshirani, Citation1996). Furthermore, it can be extended to large p small n problems by using the framework of Yin and Hilafu (Citation2015).

Disclosure statement

No potential conflict of interest was reported by the author(s).

References

  • Bellman, R. (1961). Adaptive control processes. Princeton University Press.
  • Byrd, R. H., Gilbert, J. C., & Nocedal, J. (2000). A trust region method based on interior point techniques for nonlinear programming. Mathematical Programming, 89(1), 149–185. https://doi.org/10.1007/PL00011391
  • Byrd, R. H., Mary, E. H., & Nocedal, J. (1999). An interior point algorithm for large-scale nonlinear programming. SIAM Journal on Optimization, 9(4), 877–900. https://doi.org/10.1137/S1052623497325107
  • Cook, R. D. (1994). Using dimension-reduction subspaces to identify important inputs in models of physical systems. Proc. Phys. Eng. Sci. Sect. (pp. 18–25).
  • Cook, R. D. (1996). Graphics for regressions with a binary response. Journal of the American Statistical Association, 91(435), 983–992. https://doi.org/10.1080/01621459.1996.10476968
  • Cook, R. D. (1998). Regression graphics: ideas for studying regressions through graphics. Wiley.
  • Cook, R. D., & Forzani, L. (2009). Likelihood-Based sufficient dimension reduction. Journal of the American Statistical Association, 104(485), 197–208. https://doi.org/10.1198/jasa.2009.0106
  • Cook, R. D., & Ni, L. (2005). Sufficient dimension reduction via inverse regression: a minimum discrepancy approach. Journal of the American Statistical Association, 100(470), 410–428. https://doi.org/10.1198/016214504000001501
  • Cook, R. D., & Weisberg, S. (1991). Sliced inverse regression for dimension reduction: comment. Journal of the American Statistical Association, 86(414), 328–332.
  • Cook, R. D., & Zhang, X. (2014). Fused estimators of the central subspace in sufficient dimension reduction. Journal of the American Statistical Association, 109(506), 815–827. https://doi.org/10.1080/01621459.2013.866563
  • Cui, X., Härdle, W., & Zhu, L. (2011). The EFM approach for single-index models. The Annals of Statistics, 12(3), 793–815.
  • Dong, Y., & Li, B. (2010). Dimension reduction for non-elliptically distributed predictors: second-order methods. Biometrika, 97(2), 279–294. https://doi.org/10.1093/biomet/asq016
  • Fung, W., He, X., Liu, L., & Shi, P. (2002). Dimension reduction based on canonical correlation. Statistica Sinica, 12(4), 1093–1113.
  • Härdle, W., & Stoker, T. (1989). Investigating smooth multiple regression by the method of average derivatives. Journal of the American Statistical Association, 84(408), 986–995.
  • Hristache, M., Juditsky, A., Polzehl, J., & Spokoiny, V. (2001). Structure adaptive approach for dimension reduction. The Annals of Statistics, 29(6), 1537–1811. https://doi.org/10.1214/aos/1015345954
  • Lehmann, E. L. (1999). Elements of large-sample theory. Springer-Verlag.
  • Li, K.-C. (1991). Sliced inverse regression for dimension reduction. Journal of the American Statistical Association, 86(414), 316–327. https://doi.org/10.1080/01621459.1991.10475035
  • Li, K.-C. (1992). On principal Hessian directions for data visualization and dimension reduction: another application of stein's lemma. Journal of the American Statistical Association, 87(420), 1025–1039. https://doi.org/10.1080/01621459.1992.10476258
  • Li, B., & Wang, S. (2007). On directional regression for dimension reduction. Journal of American Statistical Association, 102(479), 997–1008. https://doi.org/10.1198/016214507000000536
  • Li, L., & Yin, X. (2009). Longitudinal data analysis using sufficient dimension reduction method. Computational Statistics and Data Analysis, 53(12), 4106–4115. https://doi.org/10.1016/j.csda.2009.04.018
  • Li, B., Zha, H., & Chiaromonte, F. (2005). Contour regression: a general approach to dimension reduction. The Annals of Statistics, 33(4), 1580–1616. https://doi.org/10.1214/009053605000000192
  • Luo, W., & Li, B. (2016). Combining eigenvalues and variation of eigenvectors for order determination. Biometrika, 103(4), 875–887. https://doi.org/10.1093/biomet/asw051
  • Luo, R., Wang, H., & Tsai, C. L. (2009). Contour projected dimension reduction. The Annals of Statistics, 37(6B), 3743–3778. https://doi.org/10.1214/08-AOS679
  • Ma, Y., & Zhu, L. (2013). Efficient estimation in sufficient dimension reduction. The Annals of Statistics, 41(1), 250–268. https://doi.org/10.1214/12-AOS1072
  • Powell, J., Stock, J., & Stoker, T. (1989). Semiparametric estimation of index coefficients. Econometrica: Journal of the Econometric Society, 57(6), 1403–1430. https://doi.org/10.2307/1913713
  • Serfling, R. J. (1980). Approximation theorems of mathematical statistics. Wiley.
  • Sheng, W., & Yin, X. (2013). Direction estimation in single-index models via distance covariance. Journal of Multivariate Analysis, 122, 148–161. https://doi.org/10.1016/j.jmva.2013.07.003
  • Sheng, W., & Yin, X. (2016). Sufficient dimension reduction via distance covariance. Journal of Computational and Graphical Statistics, 25(1), 91–104. https://doi.org/10.1080/10618600.2015.1026601
  • Sheng, W., & Yuan, Q. (2020). Sufficient dimension folding in regression via distance covariance for matrix-valued predictors. Statistical Analysis and Data Mining, 13(1), 71–82. https://doi.org/10.1002/sam.v13.1
  • Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), 58(1), 267–288. https://doi.org/10.1111/rssb.1996.58.issue-1
  • Waltz, R. A., Morales, J. L., & Orban, D. (2006). An interior algorithm for nonlinear optimization that combines line search and trust region steps. Mathematical Programming, 107(3), 391–408. https://doi.org/10.1007/s10107-004-0560-5
  • Wang, H., & Xia, Y. (2008). Sliced regression for dimension reduction. Journal of the American Statistical Association, 103(482), 811–821. https://doi.org/10.1198/016214508000000418
  • Wang, Q., Yin, X., & Critchley, F. (2015). Dimension reduction based on the hellinger integral. Biometrika, 102(1), 95–106. https://doi.org/10.1093/biomet/asu062
  • Xia, Y. (2007). A constructive approach to the estimation of dimension reduction directions. The Annals of Statistics, 35(6), 2654–2690. https://doi.org/10.1214/009053607000000352
  • Xia, Y., Tong, H., Li, W. K., & Zhu, L.-X. (2002). An adaptive estimation of dimension reduction space. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 64(3), 363–410. https://doi.org/10.1111/rssb.2002.64.issue-3
  • Ye, Z., & Weiss, R. E. (2003). Using the bootstrap to select one of a new class of dimension reduction methods. Journal of the American Statistical Association, 98(464), 968–979. https://doi.org/10.1198/016214503000000927
  • Yin, X., & Hilafu, H. (2015). Sequential sufficient dimension reduction for large p, small n problems. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 77(4), 879–892. https://doi.org/10.1111/rssb.2015.77.issue-4
  • Yin, X., Li, B., & Cook, R. D. (2008). Successive direction extraction for estimating the central subspace in a multiple-index regression. Journal of Multivariate Analysis, 99(8), 1733–1757. https://doi.org/10.1016/j.jmva.2008.01.006
  • Yin, X., & Yuan, Q. (2020). A new class of measures for testing independence. Statistica Sinica, 30(4), 2131–2154.
  • Zeng, P., & Zhu, Y. (2010). An integral transform method for estimating the central mean and central subspace. Journal of Multivariate Analysis, 101(1), 271–290. https://doi.org/10.1016/j.jmva.2009.08.004
  • Zhu, L., & Fang, K. (1996). Asymptotics for kernel estimate of sliced inverse regression. The Annals of Statistics, 24(3), 1053–1068. https://doi.org/10.1214/aos/1032526955
  • Zhu, Y., & Zeng, P. (2006). Fourier methods for estimating the central subspace and the central mean subspace in regression. Journal of the American Statistical Association, 101(476), 1638–1651. https://doi.org/10.1198/016214506000000140

Appendix A

Proofs of Propositions 2.1 and 2.2

In order to prove Propositions 2.1 and 2.2 in Section 2.3 in the article, we first provide and prove the following Lemma A.1.

Lemma A.1

Suppose η is a basis of the central subspace. Let (η1,η2) be any partition of η, where ηΣXη=Id. We have C2(ηiX|Y)<C2(ηX|Y), i = 1, 2.

Proof

Let X~1=η1X, X~2=η2X, F(a,b)=C2((aX~1bX~2)|Y), aR and bR, and G1(a,b)=F(a,b)/a, G2(a,b)=F(a,b)/b. A simple calculation shows that aG1(a,b)+bG2(a,b)=F(a,b).

If (η1,η2)S(η), then F(0,1), F(1,0) > 0; otherwise, the conclusion automatically holds.

Claim, if 0λ<1, then F(1,λ)<F(1,1) and F(λ,1)<F(1,1).

If not, then there exists a 0λ0<1 such that F(1,λ0)F(1,1) or F(λ0,1)F(1,1). Without loss of generality, we assume there exists a 0λ0<1 such that F(1,λ0)F(1,1).

But F(1,λ)=λF(1λ,1), and as λ, F(1λ,1)F(0,1)>0. Thus F(1,λ), as λ. That means, there exists a λ1(λ0,) such that F(1,λ1) achieves a minimum in (λ0,). Hence, G2(1,λ1)=0. Note that function F(a,b) is a ‘ray’ function, i. e. F(ca,cb)=cF(a,b). Thus using the fact that F(1,λ)=λF(1λ,1), we can have G1(1λ1,1)=0. And it is easy to calculate that G1(1,λ1)=G1(1λ1,1)=0.

But 0=1G1(1,λ1)+λ1G2(1,λ1)=F(1,λ1). F(1,λ1)=0 means that (X~1λ1X~2)Y, which conflicts with our assumption.

Proof of Proposition 2.1

Since S(β)S(η)=SY|X, d1d, there exists a matrix A, which satisfies β=ηA. Therefore, C2(βX|Y)=C2(AηX|Y).

Assume the single value decomposition of A is UΣV, where U is a d×d orthogonal matrix, V is a d1×d1 orthogonal matrix and Σ is a d×d1 diagonal matrix with nonnegative numbers on the diagonal, and it is easy to prove that all nonnegative numbers on the diagonal of Σ are 1. Based on Theorem 3, part (2) of Yin and Yuan (Citation2020), C2(βX|Y)=C2(VΣUηX|Y)=C2(ΣUηX|Y).

Let UηX=(X~1,,X~d). Since all nonnegative numbers on the diagonal of Σ are 1 and ΣUηX=(X~1,,X~d1), by Lemma A.1, we get C2(ΣUηX|Y)C2(UηX|Y). The equality holds if and only if d=d1. And again based on Theorem 3, part (2) of Yin and Yuan (Citation2020), C2(UηX|Y)=C2(ηX|Y). Thus, C2(βX|Y)C2(ηX|Y), and equality holds if and only if S(β)=S(η).

Proof of Proposition 2.2

For the β and η described in Proposition 2.2, there exists a rotation matrix Q such that βQ=(ηa,ηb), and S(ηa)S(η), S(ηb)S(η), where S(η) is the orthogonal space of S(η).

Since YηbX|ηX and Pη(ΣX)XQη(ΣX)X, (YηX)ηbX, and according to Proposition 4.3 (Cook, Citation1998), (YηaX)ηbX. Let W1=(ηaX0), V1=Y, W2=(0ηbX), and V2=0. Then (W1,V1)(W2,V2). According to Yin and Yuan (Citation2020) Theorem 1, part (2), C(W1+W2|V1+V2)<C(W1|V1)+C(W2|V2), that is C2(QβX|Y)=C2(βX|Y)<C2(ηaX|Y)C2(ηX|Y).

Appendix B

Proof of Proposition 2.3

In order to prove Proposition 2.3 in Section 2.5 of this article, we provide and prove the following Lemma B.1 first.

Lemma B.1

If the support of X, say S, is compact and furthermore, ηnPη, then Cn2(ηnX|Y)Cn2(ηX|Y)P0.

Proof

Based on Yin and Yuan (Citation2020) Corollary 1, we have that Cn2(ηnX|Y)=1n2k,l=1n,n|ηnXkηnXl|1ny=1H1nyk,l=1ny,ny|ηnXy,kyηnXy,ly|,Cn2(ηX|Y)=1n2k,l=1n,n|ηXkηXl|1ny=1H1nyk,l=1ny,ny|ηXy,kyηXy,ly|. Because ηnη in probability, let ηn=η+ϵn. Then for any ϵ>0, ||ϵn||<ϵ, when n, where |||| is the Frobenius norm. Hence, by the condition on X, we have that for a positive constant cx, and large n, |Cn2(ηnX|Y)Cn2(ηX|Y)|ϵcx. Hence the conclusion follows.

Proof of Proposition 2.3

To simplify the proof, we restrict the support of X to be a compact set, and it can be shown that SY|X=SY|XS (Yin et al., Citation2008, Proposition 10), where XS is X restricted onto S. Without loss of generality, we assume Q=Id. Suppose ηn is not a consistent estimator of SY|X. Then there exists a subsequence, still to be indexed by n, and an η satisfying ηΣˆXη=Id such that ηnPη but Span(η)Span(η).

By Lemma B.1, we have Cn2(ηnX|Y)Cn2(ηX|Y)P0 and by Lemma 3 in Yin and Yuan (Citation2020), we have Cn2(ηX,Y)a.s.C2(ηX|Y). Therefore, Cn2(ηnX|Y)PC2(ηX|Y).

On the other hand, because ηn=argmaxβΣˆXβ=IdCn2(βX|Y), we have Cn2(ηnX|Y)Cn2(ηX|Y). If we take the limit on both sides of the above inequality, we get C2(ηX|Y)C2(ηX|Y). However, we have proved that under the assumption Pη(ΣX)XQη(ΣX)X, η=argmaxβΣXβ=IdC2(βX|Y), and we also assume that the central subspace is unique. Therefore, C2(ηX|Y)C2(ηX|Y) conflicts with the above assumption, so ηn is a consistent estimator of a basis of the central subspace.

Appendix C

Proof of Proposition 2.4

Lagrange multiplier technique is used to prove the n-consistency of vec(ηn) in the Proposition 2.4 in Section 2.5 of the article. First, we introduce the following notations and conditions and we also give a new definition.

For a random sample (X,Y)={(Xk,Yk):k=1,,n} from the joint distribution of random vectors X in Rp and Y in R, let L(ζ)=C2(βX|Y)+λ(vec(βΣXβ)vec(Id)) and Ln(ζ)=Cn2(βX|Y)+λ(vec(βΣˆXβ)vec(Id)). Here ζ=((β)λ)Rpd+d2, βRp×d, λRd2, ΣX is the covariance matrix of X, and ΣˆX is the sample estimate for ΣX. Let ηn=argmaxβΣˆXβ=IdCn2(βX|Y). Then there exists a λn such that (vec(ηn)λn) is a stationary point for Ln(ζ). Let θn=(vec(ηn)λn). Then Ln(θn)=0. Let η be a basis of CS. Then under the assumption Pη(ΣX)XQη(ΣX)X, there exists a rotation matrix Q:QQ=Id, such that ηQ=argmaxβΣXβ=IdC2(βX|Y). Without loss of generality, we assume Q=Id here. Therefore, there exists a λ0 such that (vec(η)λ0) is a stationary point for L(ζ). Let θ=(vec(η)λ0).

In the proof, we need to take derivatives of C2(ηX|Y) and Cn2(ηX|Y) with respect to vec(η), so for the simplicity of notation, when we consider the derivatives of C2(ηX|Y) and Cn2(ηX|Y), we use C(η) and Cn(η) to denote C2(ηX|Y) and Cn2(ηX|Y), respectively.

Here are additional notations, which will be used later in the following proof. I(d,d) is the vec-permutation matrix. Im is a identity matrix with rank m, and Im(:,i) denotes the ith column of Im. AB denotes the Kronecker product between matrix A and B. vec(·) is a vec operator.

Furthermore, we give the following definition and assumptions.

Definition A.1

Let Δ(η)={α:||αη||c}, where α is a p×d matrix, αΣXα=Id, c is a fixed small constant, and |||| is the Frobenius norm. We define an indicator function ρ(X,X)={0,if |α(XX)|ϵ0, for αΔ(η),1,if |α(XX)|>ϵ0, for αΔ(η), where X is an i.i.d. copy of X and ϵ0 is a small number. We define the second and third derivatives of C(η) with respect to vec(η) as C(η)ρ(X,X) and C(η)ρ(X,X). For the simplicity of notation, we will still use C(η) and C(η) to denote C(η)ρ(X,X) and C(η)ρ(X,X), respectively.

The reason we use this definition is that under Definition C.1, the second and third derivatives of C(η) and Cn(η) are bounded, near the neighbourhood of the central subspace.

Assumption C.1

Var[ϕ(1)(X,X)], Var[ϕ(2Y)(Xy,Xy)],y=1,,H, Var[ϕ(3)(X)], Var[ϕ(4)((X,X))], Var[ϕ(5)(X)], Var[ϕ(6)((X,X))], Var[ϕ(7)(X)] are all <. Here ϕ(1)(X,X)=(Id(XX))(Id(XX))vec(η)|(Id(XX))vec(η)|,ϕ(2y)(Xy,Xy)=(Id(XyXy))(Id(XyXy))vec(η)|(Id(XyXy))vec(η)|,y=1,,H,ϕ(3)(X)=(IdXXη)(Id2+Id,d)λ0,ϕ(4)(X,X)=12(Id(XX+XX)η)(Id2+Id,d)λ0,ϕ(5)(X)=vec(ηXXη),ϕ(6)(X,X)=12vec(η(XX+XX)η),ϕ(7)(X)=vec(XEX)(XEX). Here (X,Y),(X,Y) are i.i.d copies and (Xy,Yy),(Xy,Yy) are i.i.d copies in the yth slice.

Assumption C.2

(C(η)+L(IdΣXη)(Id2+I(d,d))(Id2+I(d,d))(IdηΣX)0) is nonsingular.

Assumption C.1 is needed for Proposition 2.4 in the main article and Lemma C.1 in the next section, which is similar to the assumed conditions of Theorem 6.1.6 (Lehmann, Citation1999, Ch. 6). This assumption is required by the asymptotic properties of U-statistics.

Assumption C.2 is in the spirit of von Mises proposition (Serfling, Citation1980, Section 6.1). In this proposition, it claims that if the first nonvanishing term of Taylor expansion is the linear term, then the n-consistency of the differentiable statistical function can be achieved. In our case, we assume the corresponding matrix is nonsingular, which guarantees the n-consistency. If the matrix is singular, then n or higher order consistency of some parts of our estimates can be proved.

In order to prove Proposition 2.4 in Section 2.5 of the paper, we provide and prove the following Lemma C.1 first.

Lemma C.1

Under Assumptions C.1, C.2 and the assumptions in Proposition 2.4, then n(θnθ)DN(0,V). The explicit expression for V is in the proof.

Proof

The Taylor expansion of Ln(θn) at θ is 0=Ln(θn)=Ln(θ)+Ln(θ)(θnθ)+R1(θn), where ||θnθ||||θnθ||, where |||| is the Frobenius norm and θn=(vec(ηn)λn). Next, we will give explicit expressions of Ln(θ), Ln(θ) and R1(θn). With simple calculation, Ln(θ)=(Cn(η)+(IdΣˆXη)(Id2+I(d,d))λ0vec(ηΣˆXη)vec(Id)), Ln(θ)=(Cn(η)+Lˆ(IdΣˆXη)(Id2+I(d,d))(Id2+I(d,d))(IdηΣˆX)0), where Lˆ=(vec(Lˆ11),vec(Lˆ21),,vec(Lˆp1),,vec(Lˆ1d),vec(Lˆ2d),,vec(Lˆpd)) and Lˆij=ΣˆXIp(:,i)λ0(Id2+I(d,d))(Id(:,j)Id). It is obvious that Lˆa.s.L, where L=(vec(L11),vec(L21),,vec(Lp1),,vec(L1d),vec(L2d),,vec(Lpd)) and Lij=ΣXIp(:,i)λ0(Id2+I(d,d))(Id(:,j)Id). Here i=1,,p and j=1,,d.

The remainder term R1(θn) involves the third derivative of L(ζ) at θn. Let Tn=Ln(θn), where Tn is a (pd+d2)×(pd+d2)×(pd+d2) array and each Tn(j,:,:),j=1,,pd+d2, is a (pd+d2)×(pd+d2) matrix. Therefore, the form of R1(θn) can be written as R1(θn)=12((θnθ)Tn(1,:,:)(θnθ)(θnθ)Tn(2,:,:)(θnθ)(θnθ)Tn(pd+d2,:,:)(θnθ)). Based on the above explicit expression of Ln(θ), Ln(θ) and R1(θn), the Taylor expansion of Ln(θn) at θ can be written as 0=(Cn(η)+(IdΣˆXη)(Id2+I(d,d)λ0)vec(ηΣˆXη)vec(Id))+(Cn(η)+Lˆ(IdΣˆXη)(Id2+I(d,d))(Id2+I(d,d))(IdηΣˆXη)0)(θnθ)+12((θnθ)Tn(1,:,:)(θnθ)(θnθ)Tn(2,:,:)(θnθ)(θnθ)Tn(pd+d2,:,:)(θnθ)). From the above Taylor expansion of Ln(θn) at θ, we get (Cn(η)+Lˆ(IdΣˆXη)(Id2+I(d,d))(Id2+I(d,d))(IdηΣˆXη)0)1×n(Cn(η)+(IdΣˆXη)(Id2+I(d,d)λ0)vec(ηΣˆXη)vec(Id))=[Ipd+d2+12(Cn(η)+Lˆ(IdΣˆXη)(Id2+I(d,d))(Id2+I(d,d))(IdηΣˆXη)0)1×((θnθ)Tn(1,:,:)(θnθ)Tn(2,:,:)(θnθ)Tn(pd+d2,:,:))]n(θnθ). Next, we will prove two parts.

Part 1: (Cn(η)+Lˆ(IdΣˆXη)(Id2+I(d,d))(Id2+I(d,d))(IdηΣˆX)0)1×n(Cn(η)+(IdΣˆXη)(Id2+I(d,d))λ0$vec$(ηΣˆXη)$vec$(Id))N(0,V).

Part 2: n(θnθ)=D[Ipd+d2+12(Cn(η)+Lˆ(IdΣˆXη)(Id2+I(d,d))(Id2+I(d,d))(IdηΣˆX)0)1×((θnθ)Tn(1,:,:)(θnθ)Tn(2,:,:)(θnθ)Tn(pd+d2,:,:))]n(θnθ).

Proof of part 1

We will show that both Cn(η)+(IdΣˆXη)(Id2+I(d,d))λ0 and vec(ηΣˆXη)vec(Id) are linear combinations of U-statistics and the asymptotic distribution can be achieved by the asymptotic property of U-statistics.

Based on Corollary 1 in Yin and Yuan (Citation2020), Cn(η)=1n2k,l=1n,n|η(XkXl)|1ny=1H1nyk,l=1ny,ny|η(Xy,kyXy,ly)|. With some calculation, we can get Cn(η)+(IdΣˆXη)(Id2+I(d,d))λ0=n1nU1n1ny=1H(ny1)U2y+n1nU3nn1nU4n, where U1n=(n2)11k<ln(Id(XkXl))(Id(XkXl))vec(η)|(Id(XkXl))vec(η)|,U2y=(ny2)11ky<lyny(Id(Xy,kyXy,ly))(Id(Xy,kyXy,ly))vec(η)|(Id(Xy,kyXy,ly))vec(η)|,y=1,,H,U3n=1ni=1n(IdXiXiη)(Id2+Id,d)λ0,U4n=(n2)1i<j12(Id(XiXj+XjXi)η)(Id2+Id,d)λ0. Here U1n, U2y(y=1,,H), U3n, U4n are U-statistics. In the notation U2y, y=1,,H, ky,ly=1,,ny, where H denotes the number of slices and ny is the number of samples in the yth slice.

Considering the term vec(ηΣˆXη), which is also a linear combination of U-statistics, let U5n=1ni=1nvec(ηXiXiη),U6n=(n2)1i<j12vec(η(XiXj+XjXi)η), and then vec(ηΣˆXη)=n1nU5nn1nU6n.

Let μ1=E(Id(XX))(Id(XX))vec(η)|(Id(XX))vec(η)|,μ2y=E(Id(XyXy))(Id(XyXy))vec(η)|(Id(XyXy))vec(η)|,y=1,,H,μ3=E(IdXXη)(Id2+I(d,d))λ0,μ4=(Id(EX)(EX)η)(Id2+I(d,d))λ0,μ5=vec(η(EXX)η),μ6=vec(η(EX)(EX)η). Here (X,Y),(X,Y) are i.i.d copies and (Xy,Yy)(Xy,Yy) are i.i.d copies in the yth slice.

According to Theorem 6.1.6 (Lehmann, Citation1999, Ch.6), n(U1nμ1U21μ21U2Hμ2HU3nμ3U4nμ4U5nμ5U6nμ6)DN(0,Σ), where Σ=(Σ11Σ1(H+5)Σ(H+5)1Σ(H+5)(H+5)).

Let B=(Ipd(1H)Ipd(1H)IpdIpdIpd0000000Id2×d2Id2×d2), where 0 is a pd×d2 zero matrix. Then nB(U1nμ1U21μ21U2Hμ2HU3nμ3U4nμ4U5nμ5U6nμ6)=n(U1n1Hy=1HU2y+U3nU4nU5nU6nvec(Id)).

Note that n(Cn(η)+(IdΣˆXη)(Id2+I(d,d))λ0vec(ηΣˆXη)vec(Id))=n(n1nU1n1ny=1H(ny1)U2y+n1nU3nn1nU4nn1nU5nn1nU6nvec(Id)), and under Assumption C.1, n((n1)nU1n1ny=1H(ny1)U2y+n1nU3nn1nU4nn1nU5nn1nU6nvec(Id))n(U1n1Hy=1HU2y+U3n+U4nU5nU6nvec(Id))P0. Therefore, according to Slutsky's theorem, n(Cn(η)+(IdΣˆXη)(Id2+I(d,d))λ0vec(ηΣˆXη)vec(Id))=DnB(U1nμ1U21μ21U2Hμ2HU3nμ3U4nμ4U5nμ5U6nμ6).

Let An=(Cn(η)+Lˆ(IdΣˆXη)(Id2+I(d,d))(Id2+I(d,d))(IdηΣˆX)0)1,A=(C(η)+L(IdΣXη)(Id2+I(d,d))(Id2+I(d,d))(IdηΣX)0)1.

Under Assumption C.2 and our definition of second derivative of Cn(η), by SLLN of U-statistics, Ana.s.A. Therefore, (Cn(η)+Lˆ(IdΣˆXη)(Id2+I(d,d))(Id2+I(d,d))(IdηΣˆX)0)1×n(Cn(η)+(IdΣˆXη)(Id2+I(d,d))λ0vec(ηΣˆXη)vec(Id))=DnAB(U1nμ1U21μ21U2Hμ2HU3nμ3U4nμ4U5nμ5U6nμ6)N(0,V), where V=ABΣBA.

Proof of part 2

Under Assumption C.2 and Definition C.1, Ipd+d2+12(Cn(η)+Lˆ(IdΣˆXη)(Id2+I(d,d))(Id2+I(d,d))(IdηΣˆX)0)1((θnθ)Tn(1,:,:)(θnθ)Tn(2,:,:)(θnθ)Tn(pd+d2,:,:))PIpd+d2. Therefore, by Slutsky's theorem, n(θnθ)=D[Ipd+d2+12(Cn(η)+(vec(Lˆ11)vec(Lˆpd))(IdΣˆXη)(Id2+I(d,d))(Id2+I(d,d))(IdηΣˆX)0)1((θnθ)Tn(1,:,:)(θnθ)Tn(2,:,:)(θnθ)Tn(pd+d2,:,:))]×n(θnθ). Therefore, n(θnθ)DN(0,V), or in other words, θn is n-consistent estimation of θ.

In the above proof, without loss of generality, we assume that Q=Id. Note that with an orthogonal matrix Q, Cn2(QβX,Y)=Cn2(βX,Y) and C2(QβX,Y)=C2(βX,Y) (Yin & Yuan, Citation2020). If define ηQ=ηQ, without assuming Q=Id, then Lemma C.1 holds by using C(ηQ) which is obtained by replacing every η in C(η) with ηQ. (Of course, then C(ηId)=C(η) in the proof).

Proof of Proposition 2.4

Let G=(Ipd,0) be a pd×(pd+d2) matrix, where Ipd is a pd×pd identity matrix. Then vec(ηn)=Gθn and vec(ηQ)=Gθ. By Lemma C.1, we have n(vec(ηn)vec(ηQ))=nG(θnθ)DN(0,V11(ηQ)), or in other word, n[vec(ηn)vec(ηQ)]DN(0,V11(ηQ)), where V11(ηQ)=GV(ηQ)G.