Full article: Dimension reduction with expectation of conditional difference measure

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

In this article, we introduce a flexible model-free approach to sufficient dimension reduction analysis using the expectation of conditional difference measure. Without any strict conditions, such as linearity condition or constant covariance condition, the method estimates the central subspace exhaustively and efficiently under linear or nonlinear relationships between response and predictors. The method is especially meaningful when the response is categorical. We also studied the $\sqrt{n}$ -consistency and asymptotic normality of the estimate. The efficacy of our method is demonstrated through both simulations and a real data analysis.

Keywords:

1. Introduction

With the increase of dimensionality, the volume of the space increases so fast that the available data become sparse (Bellman, Citation1961). The sparsity is a problem to many statistical methods since not enough data is available to do model fitting or make inference. Because of the situations discussed above, many classical models derived from oversimplified assumptions and nonparametric methods are no longer reliable. Therefore, dimension reduction that reduces the data dimension but retains (sufficient) important information can play a critical role in high-dimensional data analysis. With dimension reduction as a pre-process, often the number of reduced dimensions is small. Hence, parametric and nonparametric modelling methods can then be readily applied to the reduced data.

Sufficient dimension reduction is one approach to do dimension reduction, which focuses on finding a linear transformation of the predictor matrix, so that given that transformation, the response and the predictor are independent (Cook, Citation1994, Citation1996; Li, Citation1991). For the past 25 years, sufficient dimension reduction is a hot topic and many methods have been developed to estimate the central subspace (Cook, Citation1996). These methods can be classified into three groups: inverse, forward and joint regression methods. Inverse regression methods use the regression of $X | Y$ , and require certain conditions on $X$ , such as linearity condition and/or constant covariance condition. Specific methods include sliced inverse regression (SIR; Li, Citation1991), sliced average variance estimation (SAVE; Cook & Weisberg, Citation1991) and directional regression (DR; Li & Wang, Citation2007). Also see (Cook & Ni, Citation2005; Cook & Zhang, Citation2014; Dong & Li, Citation2010; Fung et al., Citation2002; Zhu & Fang, Citation1996). The forward regression methods include the minimum average variance estimation (MAVE; Xia et al., Citation2002), its variants, (Xia, Citation2007; Wang & Xia, Citation2008), average derivative estimate (Härdle & Stoker, Citation1989; Powell et al., Citation1989), and structure adaptive method (Hristache et al., Citation2001; Ma & Zhu, Citation2013). The forward methods require nonparametric approaches such as kernel smoothing. Joint regression methods require the joint distribution of $(Y, X)$ , and methods include principal hessian direction (PHD; Cook, Citation1998; Li, Citation1992), and the Fourier method (Zeng & Zhu, Citation2010; Zhu & Zeng, Citation2006). They require either smoothing techniques or stronger conditions.

In this article, we develop a new sufficient dimension reduction method based on the measure proposed in Yin and Yuan (Citation2020) to estimate the central subspace. It involves the technique of slicing the range of $Y$ into several intervals, which is similar to the classical inverse approaches, such as SIR and SAVE, but it does not require any linearity or constant covariance condition and can exhaustively recover the central subspace without smoothing requirement. On the other hand, comparing to other sufficient dimension reduction methods using distance measures, such as Sheng and Yin (Citation2016), our method makes more sense when the response $Y$ is categorical with no numerical meaning because the measure used in this article is properly defined for categorical variables.

This article is organized as follows: Section 2 introduces the new sufficient dimension reduction method, the algorithm, theoretical properties and the method of estimating the structural dimension d. In Section 3, we show the simulation studies, while Section 4 presents the real data analysis and a brief discussion is followed in Section 5.

2. Methodology

2.1. A measure of divergence

In Yin and Yuan (Citation2020), they proposed a new measure of divergence for testing independence between two random vectors. Let $X \in R^{p}$ and $Y \in R^{q}$ , where p and q are positive integers. Then the measure between $X$ and $Y$ with finite first moments is a nonnegative number, $C (X | Y)$ , defined by (1) $C^{2} (X | Y) = \int_{R^{p}} | f_{X | Y} (t) - f_{X} (t) |^{2} w (t) d t,$ (1) where $f_{X | Y}$ and $f_{X}$ stand for the characteristic functions of $X | Y$ and $X$ , respectively. Let $| f |^{2} = f \bar{f}$ for a complex-valued function f, with $\bar{f}$ being the conjugate of f. The weight function $w (t)$ is a specially chosen positive function. More details of $w (t)$ can be found in Yin and Yuan (Citation2020). They also give an equivalent formula as (2) $C^{2} (X | Y) = E | X - X_{Y}^{'} | - E | X_{Y} - X_{Y}^{'} | = E | X - X^{'} | - E | X_{Y} - X_{Y}^{'} |,$ (2) where the expectation is over all random vectors. For instance, the last expectation is first taking the conditional expectation given $Y$ , then over $Y$ . $(X^{'}, Y^{'})$ is an independent and identically distributed copy of $(X, Y)$ . $X_{Y}$ denotes a random variable distributed as $X | Y$ , $X_{Y^{'}}^{'}$ denotes a random variable distributed as $X^{'} | Y^{'}$ and $X_{Y}^{'}$ denotes a random variable distributed as $X^{'} | Y^{'}$ with $Y^{'} = Y$ .

One property of $C^{2} (X | Y)$ is that it equals 0 if and only if the two random vectors are independent (Yin & Yuan, Citation2020). This property makes it possible that $C^{2} (X | Y)$ can be used as a sufficient dimension reduction tool. What's more, the measure works well for both continuous and categorical $Y$ and because $C^{2} (X | Y)$ is well defined for categorical $Y$ , our method is particularly meaningful when the class index of dataset does not have numerical meaning, where other measures do not attain similar advantage.

2.2. Review of sufficient dimension reduction

Let $γ$ be a $p \times q$ matrix with $q \leq p$ , and be the independence notation. The following conditional independence leads to the definition of sufficient dimension reduction: (3) $Y ⫫ X | γ^{⊤} X,$ (3) where (Equation3(3) $Y ⫫ X | γ^{⊤} X,$ (3) ) indicates that the regression information of $Y$ given $X$ is completely contained in the linear combinations of $X$ , $γ^{⊤} X$ . The column space of $γ$ in (Equation3(3) $Y ⫫ X | γ^{⊤} X,$ (3) ), denoted by $S (γ)$ , is called a dimension reduction subspace.

If the intersection of all dimension reduction subspace is itself a dimension reduction subspace, then it is called the central subspace (CS), and it is denoted by $S_{Y | X}$ (Cook, Citation1994, Citation1996; Li, Citation1991)). Under mild conditions, CS exists (Cook, Citation1998; Yin et al., Citation2008). Throughout the article, we assume CS exists, which is unique. Furthermore, let d denote the structural dimension of the CS, and let $Σ_{X}$ be the covariance matrix of $X$ , which is assumed to be nonsingular. Our primary goal is to identify the CS by estimating d and a $p \times d$ basis matrix of CS.

Here we introduce some notations needed in the following sections. Let $β$ be a matrix and $S (β)$ be the subspace spanned by the column vectors of $β$ . dim $(S (β))$ is the dimension of $S (β)$ . $P_{β (Σ_{X})}$ denotes the projection operator, which projects onto $S (β)$ with respect to the inner product $〈 a, b 〉 = a^{⊤} Σ_{X} b$ , that is, $P_{β (Σ_{X})} = β (β^{⊤} Σ_{X} β)^{- 1} β^{⊤} Σ_{X}$ . Let $Q_{β (Σ_{X})} = I - P_{β (Σ_{X})}$ , where I is the identity matrix.

2.3. The new sufficient dimension reduction method

Let $β$ be a $p \times d_{0}$ arbitrary matrix, where $1 \leq d_{0} \leq p$ . Under mild conditions, it can be proved that solving (Equation4(4) $max_{\begin{matrix} β : β^{⊤} Σ_{X} β = I_{d_{0}} \\ 1 \leq d_{0} \leq p \end{matrix}} C^{2} (β^{⊤} X | Y) .$ (4) ) will yield a basis of the central subspace. (4) $max_{\begin{matrix} β : β^{⊤} Σ_{X} β = I_{d_{0}} \\ 1 \leq d_{0} \leq p \end{matrix}} C^{2} (β^{⊤} X | Y) .$ (4) Here the squared divergence between $β^{⊤} X$ and $Y$ is defined as $C^{2} (β^{⊤} X | Y) = E_{Y} [\int_{R^{d_{0} + 1}} | f_{β^{⊤} X | Y} (t) - f_{β^{⊤} X} (t) |^{2} w (t) d t] .$ The conditions $E | X | < \infty$ , $E | Y | < \infty$ and $E | X_{Y} | < \infty$ in Yin and Yuan (Citation2020) guarantee that the $C^{2} (β^{⊤} X | Y)$ is finite. Thus throughout the article, we assume they hold. The constraint $β^{⊤} Σ_{X} β = I_{d_{0}}$ in (Equation4(4) $max_{\begin{matrix} β : β^{⊤} Σ_{X} β = I_{d_{0}} \\ 1 \leq d_{0} \leq p \end{matrix}} C^{2} (β^{⊤} X | Y) .$ (4) ) is needed due to the property $C^{2} (c β^{⊤} X | Y) = | c | C^{2} (β^{⊤} X | Y)$ for any constant c (Yin & Yuan, Citation2020).

The following propositions justify our estimator. They ensure that if we maximize $C^{2} (β^{⊤} X | Y)$ with respect to $β$ under the constraint and some mild conditions, the solution indeed spans the CS.

Proposition 2.1

Let $η$ be a $p \times d$ basis matrix of the CS, $β$ be a $p \times d_{1}$ matrix with $d_{1} \leq d$ , $d i m (S (β)) = d_{1}$ , $η^{⊤} Σ_{X} η = I_{d}$ and $β^{⊤} Σ_{X} β = I_{d_{1}}$ . If $S (β) \subseteq S (η)$ , then $C^{2} (β^{⊤} X | Y) \leq C^{2} (η^{⊤} X | Y)$ . The equality holds if and only if $S (β) = S (η)$ .

Proposition 2.2

Let $η$ be a $p \times d$ basis matrix of the CS, $β$ be a $p \times d_{2}$ matrix with $η^{⊤} Σ_{X} η = I_{d}$ and $β^{⊤} Σ_{X} β = I_{d_{2}}$ . Here $d_{2}$ could be bigger, less or equal to d. Suppose $P_{η (Σ_{X})}^{⊤} X ⫫ Q_{η (Σ_{X})}^{⊤} X$ and $S (β) ⊈ S (η)$ . Then $C^{2} (β^{⊤} X | Y) < C^{2} (η^{⊤} X | Y)$ .

Proposition 2.1 indicates that if $S (β)$ is a subspace of the CS, then $C^{2} (β^{⊤} X | Y)$ is less than or equal to $C^{2} (η^{⊤} X | Y)$ and the equality holds if and only if $β$ is a basis matrix of the CS, i. e., $S (β) = S (η)$ . Proposition 2.2 implies that if $S (β)$ is not a subspace of the CS, then $C^{2} (β^{⊤} X | Y)$ is less than $C^{2} (η^{⊤} X | Y)$ under a mild condition. The above two propositions show that we can identify the CS by maximizing $C^{2} (β^{⊤} X | Y)$ with respect to $β$ under the quadratic constraint. The condition in Proposition 2.2, $P_{η (Σ_{X})}^{⊤} X ⫫ Q_{η (Σ_{X})}^{⊤} X$ , was discussed in Sheng and Yin (Citation2013), where they showed the condition is not very strict and can be satisfied asymptotically when p is reasonably large. Proofs for Propositions 2.1 and 2.2 are in the Appendix A.

2.4. Estimating the CS when d is specified

In this section, we develop an algorithm for estimating the CS when the structural dimension d is known. Let $(X, Y) = {(X_{i}, Y_{i}), i = 1, \dots, n}$ be a random sample from $(X, Y)$ and let $β$ be a $p \times d$ matrix. For the purpose of slicing, these n observations can be equivalently written as $X_{y, k_{y}}, Y_{y, k_{y}}$ , where $y = 1, \dots, H$ , $k_{y} = 1, \dots, n_{y}$ , where $n_{y}$ is the number of observations for slice y. The empirical version of $C^{2} (β^{⊤} X | Y)$ denoted by $C_{n}^{2} (β^{⊤} X | Y)$ is defined as: (5) $C_{n}^{2} (X | Y) = \frac{1}{n^{2}} \sum_{k, l = 1}^{n, n} | X_{k} - X_{l} | - \frac{1}{n} \sum_{y = 1}^{H} \frac{1}{n_{y}} \sum_{k_{y}, l_{y} = 1}^{n_{y}, n_{y}} | X_{y, k_{y}} - X_{y, l_{y}} | .$ (5) Here $| \cdot |$ is the Euclidean norm in the respective dimension. Let ${\hat{Σ}}_{X}$ be the estimate of $Σ_{X}$ . Then an estimated basis matrix of the CS, say $η_{n}$ , is (6) $η_{n} = {\arg max}_{β : β^{⊤} {\hat{Σ}}_{X} β = I_{d}} C_{n}^{2} (β^{⊤} X | Y) .$ (6) An outline of the algorithm is as follows.

Obtain the initials $η^{(0)}$ : any existing sufficient dimension reduction method, such as SIR (Li, Citation1991) or SAVE (Cook & Weisberg, Citation1991) can be used to obtain the initial.
Iterations: let $η^{(k)}$ be the estimate of $η$ in the kth iteration. In order to search for the $η^{(k + 1)}$ , the interior-point approach is applied. In the interior-point approach, the original optimization problem in (6) is replaced by a sequence of barrier subproblems, which are solved approximately by two powerful tools: sequential quadratic programming and trust region techniques. In this process, one of two main types of steps is used at each iteration: a direct step or a conjugate gradient step. By default, the algorithm tries a direct step first. If a direct step fails, it attempts a conjugate gradient step. More extensive descriptions of the interior-point approach are in Byrd et al. (Citation2000, Citation1999) and Waltz et al. (Citation2006).
Check convergence: if the difference between $η^{(k)}$ and $η^{(k + 1)}$ is smaller than the preset tolerance value, such as $10^{- 6}$ , then stop the iteration and set $η_{n} = η^{(k + 1)}$ ; otherwise, set k: = k + 1 and go to step 2.

In the above algorithm, we assume the structural dimension d is known, which is not true in practice. We will propose an approach to estimate d in Section 2.6.

2.5. Theoretical properties

Proposition 2.3

Let $η_{n} = {\arg max}_{β^{⊤} {\hat{Σ}}_{X} β = I_{d}} C_{n}^{2} (β^{⊤} X | Y)$ , and $η$ be a basis matrix of the CS with $η^{⊤} Σ_{X} η = I_{d}$ . Under the condition $P_{η (Σ_{X})}^{⊤} X ⫫ Q_{η (Σ_{X})}^{⊤} X$ , $η_{n}$ is a consistent estimator of a basis of the CS, that is, there exists a rotation matrix $Q$ : $Q^{⊤} Q = I_{d}$ , such that $η_{n} \overset{P}{⟶} η Q$ .

Furthermore, we can prove the $\sqrt{n}$ -consistency and asymptotic normality of the estimator as stated below.

Proposition 2.4

Let $η_{n} = {\arg max}_{β^{⊤} {\hat{Σ}}_{X} β = I_{d}} C_{n}^{2} (β^{⊤} X | Y)$ , and $η$ be a basis matrix of the CS with $η^{⊤} Σ_{X} η = I_{d}$ . Under the regularity conditions in the supplementary file, there exists a rotation matrix $Q$ : $Q^{⊤} Q = I_{d}$ such that $\sqrt{n} [v e c (η_{n}) - v e c (η Q)] \overset{D}{⟶} N (0, V_{11} (η_{Q}))$ , where $V_{11} (η_{Q})$ is the covariance matrix given in the supplementary file.

Proofs of Propositions 2.3 and 2.4 are in Appendices B and C, respectively.

2.6. Estimating structural dimension d

There is a rich literature of discussing determining d in sufficient dimension reduction, for example, some nonparametric methods such as Wang and Xia (Citation2008), Ye and Weiss (Citation2003) and Luo and Li (Citation2016) and some eigen-decomposition-based methods, for examples, Luo et al. (Citation2009), and Wang et al. (Citation2015). Here we apply the kNN method proposed in Wang et al. (Citation2015).

Given a sample ${(X_{i}, Y_{i}), 1 \leq i \leq n}$ , d can be estimated by the following kNN procedure.

Find the k-nearest neighbours for each data point $(X_{i}, Y_{i})$ using Euclidean distance. Denote the k-nearest neighbours of $(X_{i}, Y_{i})$ as $(X_{i}^{(j)}, Y_{i}^{(j)}), 1 \leq j \leq k$ .
For each data point $(X_{i}, Y_{i})$ , apply the method proposed in this article to its k-nearest neighbours and estimate $\hat{β_{i}}$ . Here the dimension of $\hat{β_{i}}$ is set as 1.
Calculate the eigenvalues of the matrix $\sum_{i = 1}^{n} {\hat{β}}_{i} {\hat{β}}_{i}^{⊤}$ . Denote and order them as $λ_{1} \geq λ_{2} \geq \dots \geq λ_{p}$ .
Calculate the ratios $r_{i} = λ_{i} / λ_{i + 1}, 1 \leq i \leq p - 1$ . The dimension d is estimated as the largest $r_{i}$ happens in the sequence.

In the last step, this maximal eigenvalue ratio criterion was suggested by Luo et al. (Citation2009) and was also used by Li and Yin (Citation2009) and Sheng and Yuan (Citation2020).

3. Simulation studies

Estimation accuracy is measured by the distance $Δ_{m} (\hat{S}, S) =∥ P_{\hat{S}} - P_{S} ∥$ (Li et al., Citation2005), where $S$ is the real d-dimensional CS of $R^{p}$ , $\hat{S}$ is the estimate, $P_{S}$ , $P_{\hat{S}}$ are the orthogonal projections onto $S$ and $\hat{S}$ , respectively and $∥ \cdot ∥$ is the maximum singular value of a matrix. The smaller the $Δ_{m}$ is, the better the estimate is. Also a method works better if it has a smaller standard error of $Δ_{m}$ . In the following, the first three examples show the nice performance of the proposed method in terms of both continuous and categorical response, assuming we already know the dimension d. The last example illustrates the performance of estimating dimension d using the kNN procedure in Section 2.6.

Example 3.1

Consider the Model 1 $Y = (β_{1}^{⊤} X)^{2} + (β_{2}^{⊤} X) + 0.1 ϵ,$ where $X \sim N (0, I_{p})$ , $ϵ \sim N (0, 1)$ and ϵ is independent of $X$ . $β_{1} = (1, 0, \dots, 0)^{⊤}$ , and $β_{2} = (0, 1, \dots, 0)^{⊤}$ . We compare DCOV (Sheng & Yin, Citation2016), SIR (Li, Citation1991), SAVE (Cook & Weisberg, Citation1991) and LAD (Cook & Forzani, Citation2009) with our method ECD with 10 slices.

Table shows the average estimation accuracy ( ${\bar{Δ}}_{m}$ ) and its standard error (SE) under different $(n, p)$ combinations and 500 replications. Note that ECD performs consistently better than other methods, under all the different $(n, p)$ combinations.

Table 1. Estimation accuracy report for Model 1.

Display Table

Example 3.2

This model was studied by Cui et al. (Citation2011). It has binary responses 1 and 0, which have no numerical meaning. Model 2 is $P (Y = 1 | X) = \frac{\exp (g (β_{3}^{⊤} X))}{1 + \exp (g (β_{3}^{⊤} X))},$ where $g (β_{3}^{⊤} X) = \exp (5 β_{3}^{⊤} X - 2) / {1 + \exp (5 β_{3}^{⊤} X - 3)} - 1.5$ , $X \sim N (0, I_{p})$ and $β_{3}^{⊤} = (2, 1, 0, \dots, 0) / \sqrt{5}$ . The simulation results are reported in Table .

Table 2. Estimation accuracy report for Model 2.

Display Table

Example 3.3

Consider another binary-response model, Model 3: $Y = s i g n (\frac{\sin (β_{4}^{⊤} X)}{β_{5}^{⊤} X} + 0.2 ϵ),$ where $X$ follows the multivariate uniform distribution $u n i f (- 2, 2)^{p}$ , $ϵ \sim N (0, 1)$ , and ϵ is independent of $X$ , $β_{4} = (1, 0, \dots, 0)^{⊤}$ , and $β_{5} = (0, 1, 0, \dots, 0)^{⊤}$ . The simulation results are reported in Table .

Table 3. Estimation accuracy report for Model 3.

Display Table

From the simulation results, we find ECD method outperforms other methods when the response is continuous. When the response is categorical, it also performs better than SIR, SAVE and LAD and its performance is comparable to DCOV. To be more specific, the accuracy of ECD and DCOV is very close as sample size n gets large when the response $Y$ is categorical. On the other hand, the computation speed of ECD is faster than that of DCOV due to its slicing technique in calculating $C_{n}^{2} (β^{⊤} X | Y)$ . For example, when $(n, p) = (200, 6)$ , ECD is about 2.7 times faster than DCOV under Model 1 and 2, and about 3.6 times faster under Model 3. Overall, ECD is superior to other methods.

Example 3.4

Estimating d. We test the performance of the kNN procedure in Section 2.6 based on Model 1 and Model 2. Table shows that the kNN procedure can estimate dimension d very precisely, no matter the response is continuous or categorical.

Table 4. Accuracy of estimating d with kNN procedure.

Download CSV Display Table

4. Real data analysis

To further investigate the performance of our method, we apply it to the Pen Digit database from the UCI machine-learning repository. The data contains 10,992 samples of hand-written digits ${0, 1, \dots, 9}$ . The digits were collected from 44 writers and every writer was asked to write 250 random digits. Every digit is represented as a 16-dimensional feature vector. The 44 writers are divided into two groups, in which 30 are used for training, while others are used for testing. The data set and more details are available at archive.ics.uci.edu/ml/machine-learning-databases/pendigits/.

We choose the 0's, 6's and 9's, three hardly classified digits, as an illustration. In this subset of the database, there are 2,219 cases in the training data and 1,035 cases in the test data. We apply the dimension reduction methods to the 16-dimensional predictor vector for the training set, which serves as a preparatory step for the three-group classification problem. Because the response has three slices, SIR estimates only two directions in the dimension reduction subspace. The other methods, SAVE, DCOV and ECD, all estimate three directions. Figure presents the two-dimensional plot of (SIR1, SIR2) and Figure shows the three dimensional plots of (SAVE1, SAVE2, SAVE3), (DCOV1, DCOV2, DCOV3) and (ECD1, ECD2, ECD3). SIR provides only location separation of the three groups. SAVE implies there are covariance differences among three groups, but no clear location separation is provided. Both DCOV and ECD get the location separation and covariance differences, but ECD presents a more clear separation among the three groups.The three-dimensional plot of (ECD1, ECD2, ECD3) gives a comprehensive demonstration of the different features of the three groups.

Figure 1. 2D-plot for the two predictors estimated by SIR.

Figure 2. 3D-plots for the three predictors estimated by SAVE, DCOV and ECD.

5. Discussion

In this article, we proposed a new sufficient dimension reduction method. We studied its asymptotic properties and introduced the kNN procedure to estimate the structural dimension d. The numerical studies show that our method can estimate the CS accurately and efficiently. In the future, we consider to develop a variable selection method by combining our method with the penalized method such as LASSO (Tibshirani, Citation1996). Furthermore, it can be extended to large p small n problems by using the framework of Yin and Hilafu (Citation2015).

Disclosure statement

No potential conflict of interest was reported by the author(s).

References

Bellman, R. (1961). Adaptive control processes. Princeton University Press.
Google Scholar
Byrd, R. H., Gilbert, J. C., & Nocedal, J. (2000). A trust region method based on interior point techniques for nonlinear programming. Mathematical Programming, 89(1), 149–185. https://doi.org/10.1007/PL00011391
Web of Science ®Google Scholar
Byrd, R. H., Mary, E. H., & Nocedal, J. (1999). An interior point algorithm for large-scale nonlinear programming. SIAM Journal on Optimization, 9(4), 877–900. https://doi.org/10.1137/S1052623497325107
Web of Science ®Google Scholar
Cook, R. D. (1994). Using dimension-reduction subspaces to identify important inputs in models of physical systems. Proc. Phys. Eng. Sci. Sect. (pp. 18–25).
Google Scholar
Cook, R. D. (1996). Graphics for regressions with a binary response. Journal of the American Statistical Association, 91(435), 983–992. https://doi.org/10.1080/01621459.1996.10476968
Web of Science ®Google Scholar
Cook, R. D. (1998). Regression graphics: ideas for studying regressions through graphics. Wiley.
Google Scholar
Cook, R. D., & Forzani, L. (2009). Likelihood-Based sufficient dimension reduction. Journal of the American Statistical Association, 104(485), 197–208. https://doi.org/10.1198/jasa.2009.0106
Web of Science ®Google Scholar
Cook, R. D., & Ni, L. (2005). Sufficient dimension reduction via inverse regression: a minimum discrepancy approach. Journal of the American Statistical Association, 100(470), 410–428. https://doi.org/10.1198/016214504000001501
Web of Science ®Google Scholar
Cook, R. D., & Weisberg, S. (1991). Sliced inverse regression for dimension reduction: comment. Journal of the American Statistical Association, 86(414), 328–332.
Google Scholar
Cook, R. D., & Zhang, X. (2014). Fused estimators of the central subspace in sufficient dimension reduction. Journal of the American Statistical Association, 109(506), 815–827. https://doi.org/10.1080/01621459.2013.866563
Web of Science ®Google Scholar
Cui, X., Härdle, W., & Zhu, L. (2011). The EFM approach for single-index models. The Annals of Statistics, 12(3), 793–815.
Google Scholar
Dong, Y., & Li, B. (2010). Dimension reduction for non-elliptically distributed predictors: second-order methods. Biometrika, 97(2), 279–294. https://doi.org/10.1093/biomet/asq016
Web of Science ®Google Scholar
Fung, W., He, X., Liu, L., & Shi, P. (2002). Dimension reduction based on canonical correlation. Statistica Sinica, 12(4), 1093–1113.
Web of Science ®Google Scholar
Härdle, W., & Stoker, T. (1989). Investigating smooth multiple regression by the method of average derivatives. Journal of the American Statistical Association, 84(408), 986–995.
Web of Science ®Google Scholar
Hristache, M., Juditsky, A., Polzehl, J., & Spokoiny, V. (2001). Structure adaptive approach for dimension reduction. The Annals of Statistics, 29(6), 1537–1811. https://doi.org/10.1214/aos/1015345954
Web of Science ®Google Scholar
Lehmann, E. L. (1999). Elements of large-sample theory. Springer-Verlag.
Google Scholar
Li, K.-C. (1991). Sliced inverse regression for dimension reduction. Journal of the American Statistical Association, 86(414), 316–327. https://doi.org/10.1080/01621459.1991.10475035
Web of Science ®Google Scholar
Li, K.-C. (1992). On principal Hessian directions for data visualization and dimension reduction: another application of stein's lemma. Journal of the American Statistical Association, 87(420), 1025–1039. https://doi.org/10.1080/01621459.1992.10476258
Web of Science ®Google Scholar
Li, B., & Wang, S. (2007). On directional regression for dimension reduction. Journal of American Statistical Association, 102(479), 997–1008. https://doi.org/10.1198/016214507000000536
Web of Science ®Google Scholar
Li, L., & Yin, X. (2009). Longitudinal data analysis using sufficient dimension reduction method. Computational Statistics and Data Analysis, 53(12), 4106–4115. https://doi.org/10.1016/j.csda.2009.04.018
Web of Science ®Google Scholar
Li, B., Zha, H., & Chiaromonte, F. (2005). Contour regression: a general approach to dimension reduction. The Annals of Statistics, 33(4), 1580–1616. https://doi.org/10.1214/009053605000000192
Web of Science ®Google Scholar
Luo, W., & Li, B. (2016). Combining eigenvalues and variation of eigenvectors for order determination. Biometrika, 103(4), 875–887. https://doi.org/10.1093/biomet/asw051
Web of Science ®Google Scholar
Luo, R., Wang, H., & Tsai, C. L. (2009). Contour projected dimension reduction. The Annals of Statistics, 37(6B), 3743–3778. https://doi.org/10.1214/08-AOS679
Web of Science ®Google Scholar
Ma, Y., & Zhu, L. (2013). Efficient estimation in sufficient dimension reduction. The Annals of Statistics, 41(1), 250–268. https://doi.org/10.1214/12-AOS1072
PubMed Web of Science ®Google Scholar
Powell, J., Stock, J., & Stoker, T. (1989). Semiparametric estimation of index coefficients. Econometrica: Journal of the Econometric Society, 57(6), 1403–1430. https://doi.org/10.2307/1913713
Web of Science ®Google Scholar
Serfling, R. J. (1980). Approximation theorems of mathematical statistics. Wiley.
Google Scholar
Sheng, W., & Yin, X. (2013). Direction estimation in single-index models via distance covariance. Journal of Multivariate Analysis, 122, 148–161. https://doi.org/10.1016/j.jmva.2013.07.003
Web of Science ®Google Scholar
Sheng, W., & Yin, X. (2016). Sufficient dimension reduction via distance covariance. Journal of Computational and Graphical Statistics, 25(1), 91–104. https://doi.org/10.1080/10618600.2015.1026601
Web of Science ®Google Scholar
Sheng, W., & Yuan, Q. (2020). Sufficient dimension folding in regression via distance covariance for matrix-valued predictors. Statistical Analysis and Data Mining, 13(1), 71–82. https://doi.org/10.1002/sam.v13.1
Web of Science ®Google Scholar
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), 58(1), 267–288. https://doi.org/10.1111/rssb.1996.58.issue-1
Web of Science ®Google Scholar
Waltz, R. A., Morales, J. L., & Orban, D. (2006). An interior algorithm for nonlinear optimization that combines line search and trust region steps. Mathematical Programming, 107(3), 391–408. https://doi.org/10.1007/s10107-004-0560-5
Web of Science ®Google Scholar
Wang, H., & Xia, Y. (2008). Sliced regression for dimension reduction. Journal of the American Statistical Association, 103(482), 811–821. https://doi.org/10.1198/016214508000000418
Web of Science ®Google Scholar
Wang, Q., Yin, X., & Critchley, F. (2015). Dimension reduction based on the hellinger integral. Biometrika, 102(1), 95–106. https://doi.org/10.1093/biomet/asu062
Web of Science ®Google Scholar
Xia, Y. (2007). A constructive approach to the estimation of dimension reduction directions. The Annals of Statistics, 35(6), 2654–2690. https://doi.org/10.1214/009053607000000352
Web of Science ®Google Scholar
Xia, Y., Tong, H., Li, W. K., & Zhu, L.-X. (2002). An adaptive estimation of dimension reduction space. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 64(3), 363–410. https://doi.org/10.1111/rssb.2002.64.issue-3
Web of Science ®Google Scholar
Ye, Z., & Weiss, R. E. (2003). Using the bootstrap to select one of a new class of dimension reduction methods. Journal of the American Statistical Association, 98(464), 968–979. https://doi.org/10.1198/016214503000000927
Web of Science ®Google Scholar
Yin, X., & Hilafu, H. (2015). Sequential sufficient dimension reduction for large p, small n problems. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 77(4), 879–892. https://doi.org/10.1111/rssb.2015.77.issue-4
Web of Science ®Google Scholar
Yin, X., Li, B., & Cook, R. D. (2008). Successive direction extraction for estimating the central subspace in a multiple-index regression. Journal of Multivariate Analysis, 99(8), 1733–1757. https://doi.org/10.1016/j.jmva.2008.01.006
Web of Science ®Google Scholar
Yin, X., & Yuan, Q. (2020). A new class of measures for testing independence. Statistica Sinica, 30(4), 2131–2154.
Web of Science ®Google Scholar
Zeng, P., & Zhu, Y. (2010). An integral transform method for estimating the central mean and central subspace. Journal of Multivariate Analysis, 101(1), 271–290. https://doi.org/10.1016/j.jmva.2009.08.004
Web of Science ®Google Scholar
Zhu, L., & Fang, K. (1996). Asymptotics for kernel estimate of sliced inverse regression. The Annals of Statistics, 24(3), 1053–1068. https://doi.org/10.1214/aos/1032526955
Web of Science ®Google Scholar
Zhu, Y., & Zeng, P. (2006). Fourier methods for estimating the central subspace and the central mean subspace in regression. Journal of the American Statistical Association, 101(476), 1638–1651. https://doi.org/10.1198/016214506000000140
Web of Science ®Google Scholar

Appendix A

Proofs of Propositions 2.1 and 2.2

In order to prove Propositions 2.1 and 2.2 in Section 2.3 in the article, we first provide and prove the following Lemma A.1.Lemma A.1 Suppose

η

is a basis of the central subspace. Let

(η_{1}, η_{2})

be any partition of

η

, where

η^{⊤} Σ_{X} η = I_{d}

. We have

C^{2} (η_{i}^{⊤} X | Y) < C^{2} (η^{⊤} X | Y)

, i = 1, 2.

Proof

Let ${\tilde{X}}_{1} = η_{1}^{⊤} X$ , ${\tilde{X}}_{2} = η_{2}^{⊤} X$ , $F (a, b) = C^{2} ((\begin{matrix} a {\tilde{X}}_{1} \\ b {\tilde{X}}_{2} \end{matrix}) | Y)$ , $a \in R$ and $b \in R$ , and $G_{1} (a, b) = \partial F (a, b) / \partial a$ , $G_{2} (a, b) = \partial F (a, b) / \partial b$ . A simple calculation shows that $a G_{1} (a, b) + b G_{2} (a, b) = F (a, b)$ .

If $(η_{1}, η_{2}) \in S (η)$ , then F(0,1), F(1,0) > 0; otherwise, the conclusion automatically holds.

Claim, if $0 \leq λ < 1$ , then $F (1, λ) < F (1, 1)$ and $F (λ, 1) < F (1, 1)$ .

If not, then there exists a $0 \leq λ_{0} < 1$ such that $F (1, λ_{0}) \geq F (1, 1)$ or $F (λ_{0}, 1) \geq F (1, 1)$ . Without loss of generality, we assume there exists a $0 \leq λ_{0} < 1$ such that $F (1, λ_{0}) \geq F (1, 1)$ .

But $F (1, λ) = λ F (\frac{1}{λ}, 1)$ , and as $λ \to \infty$ , $F (\frac{1}{λ}, 1) \to F (0, 1) > 0$ . Thus $F (1, λ) \to \infty$ , as $λ \to \infty$ . That means, there exists a $λ_{1} \in (λ_{0}, \infty)$ such that $F (1, λ_{1})$ achieves a minimum in $(λ_{0}, \infty)$ . Hence, $G_{2} (1, λ_{1}) = 0$ . Note that function $F (a, b)$ is a ‘ray’ function, i. e. $F (c a, c b) = c F (a, b)$ . Thus using the fact that $F (1, λ) = λ F (\frac{1}{λ}, 1)$ , we can have $G_{1} (\frac{1}{λ_{1}}, 1) = 0$ . And it is easy to calculate that $G_{1} (1, λ_{1}) = G_{1} (\frac{1}{λ_{1}}, 1) = 0$ .

But $0 = 1 G_{1} (1, λ_{1}) + λ_{1} G_{2} (1, λ_{1}) = F (1, λ_{1})$ . $F (1, λ_{1}) = 0$ means that $(\begin{matrix} {\tilde{X}}_{1} \\ λ_{1} {\tilde{X}}_{2} \end{matrix}) ⫫ Y$ , which conflicts with our assumption.

Proof of Proposition 2.1

Since $S (β) \subseteq S (η) = S_{Y | X}$ , $d_{1} \leq d$ , there exists a matrix $A$ , which satisfies $β = η A$ . Therefore, $C^{2} (β^{⊤} X | Y) = C^{2} (A^{⊤} η^{⊤} X | Y)$ .

Assume the single value decomposition of A is $U Σ V^{⊤}$ , where U is a $d \times d$ orthogonal matrix, V is a $d_{1} \times d_{1}$ orthogonal matrix and Σ is a $d \times d_{1}$ diagonal matrix with nonnegative numbers on the diagonal, and it is easy to prove that all nonnegative numbers on the diagonal of Σ are 1. Based on Theorem 3, part (2) of Yin and Yuan (Citation2020), $C^{2} (β^{⊤} X | Y) = C^{2} (V Σ^{⊤} U^{⊤} η^{⊤} X | Y) = C^{2} (Σ^{⊤} U^{⊤} η^{⊤} X | Y)$ .

Let $U^{⊤} η^{⊤} X = ({\tilde{X}}_{1}, \dots, {\tilde{X}}_{d})^{⊤}$ . Since all nonnegative numbers on the diagonal of Σ are 1 and $Σ^{⊤} U^{⊤} η^{⊤} X = ({\tilde{X}}_{1}, \dots, {\tilde{X}}_{d_{1}})^{⊤}$ , by Lemma A.1, we get $C^{2} (Σ^{⊤} U^{⊤} η^{⊤} X | Y) \leq C^{2} (U^{⊤} η^{⊤} X | Y)$ . The equality holds if and only if $d = d_{1}$ . And again based on Theorem 3, part (2) of Yin and Yuan (Citation2020), $C^{2} (U^{⊤} η^{⊤} X | Y) = C^{2} (η^{⊤} X | Y)$ . Thus, $C^{2} (β^{⊤} X | Y) \leq C^{2} (η^{⊤} X | Y)$ , and equality holds if and only if $S (β) = S (η)$ .

Proof of Proposition 2.2

For the $β$ and $η$ described in Proposition 2.2, there exists a rotation matrix $Q$ such that $β Q = (η_{a}, η_{b})$ , and $S (η_{a}) \subseteq S (η)$ , $S (η_{b}) \subseteq S (η)^{⊥}$ , where $S (η)^{⊥}$ is the orthogonal space of $S (η)$ .

Since $Y ⫫ η_{b}^{⊤} X | η^{⊤} X$ and $P_{η (Σ_{X})}^{⊤} X ⫫ Q_{η (Σ_{X})}^{⊤} X$ , $(\begin{matrix} Y \\ η^{⊤} X \end{matrix}) ⫫ η_{b}^{⊤} X$ , and according to Proposition 4.3 (Cook, Citation1998), $(\begin{matrix} Y \\ η_{a}^{⊤} X \end{matrix}) ⫫ η_{b}^{⊤} X$ . Let $W_{1} = (\begin{matrix} η_{a}^{⊤} X \\ 0 \end{matrix})$ , $V_{1} = Y$ , $W_{2} = (\begin{matrix} 0 \\ η_{b}^{⊤} X \end{matrix})$ , and $V_{2} = 0$ . Then $(W_{1}, V_{1}) ⫫ (W_{2}, V_{2})$ . According to Yin and Yuan (Citation2020) Theorem 1, part (2), $C (W_{1} + W_{2} | V_{1} + V_{2}) < C (W_{1} | V_{1}) + C (W_{2} | V_{2})$ , that is $C^{2} (Q^{⊤} β^{⊤} X | Y) = C^{2} (β^{⊤} X | Y) < C^{2} (η_{a}^{⊤} X | Y) \leq C^{2} (η^{⊤} X | Y)$ .

Appendix B

Proof of Proposition 2.3

In order to prove Proposition 2.3 in Section 2.5 of this article, we provide and prove the following Lemma B.1 first.

Lemma B.1

If the support of X, say S, is compact and furthermore, $η_{n} \overset{P}{⟶} η$ , then $C_{n}^{2} (η_{n}^{⊤} X | Y) - C_{n}^{2} (η^{⊤} X | Y) \overset{P}{⟶} 0$ .

Proof

Based on Yin and Yuan (Citation2020) Corollary 1, we have that $\begin{aligned} C_{n}^{2} (η_{n}^{⊤} X | Y) = \frac{1}{n^{2}} \sum_{k, l = 1}^{n, n} | η_{n}^{⊤} X_{k} - η_{n}^{⊤} X_{l} | - \frac{1}{n} \sum_{y = 1}^{H} \frac{1}{n_{y}} \sum_{k, l = 1}^{n_{y}, n_{y}} | η_{n}^{⊤} X_{y, k_{y}} - η_{n}^{⊤} X_{y, l_{y}} |, \\ C_{n}^{2} (η^{⊤} X | Y) = \frac{1}{n^{2}} \sum_{k, l = 1}^{n, n} | η^{⊤} X_{k} - η^{⊤} X_{l} | - \frac{1}{n} \sum_{y = 1}^{H} \frac{1}{n_{y}} \sum_{k, l = 1}^{n_{y}, n_{y}} | η^{⊤} X_{y, k_{y}} - η^{⊤} X_{y, l_{y}} | . \end{aligned}$ Because $η_{n} \to η$ in probability, let $η_{n} = η + ϵ_{n}$ . Then for any $ϵ > 0$ , $| | ϵ_{n} | | < ϵ$ , when $n \to \infty$ , where $| | \cdot | |$ is the Frobenius norm. Hence, by the condition on X, we have that for a positive constant $c_{x}$ , and large n, $| C_{n}^{2} (η_{n}^{⊤} X | Y) - C_{n}^{2} (η^{⊤} X | Y) | \leq ϵ c_{x}$ . Hence the conclusion follows.

Proof of Proposition 2.3

To simplify the proof, we restrict the support of $X$ to be a compact set, and it can be shown that $S_{Y | X} = S_{Y | X_{S}}$ (Yin et al., Citation2008, Proposition 10), where $X_{S}$ is $X$ restricted onto S. Without loss of generality, we assume $Q = I_{d}$ . Suppose $η_{n}$ is not a consistent estimator of $S_{Y | X}$ . Then there exists a subsequence, still to be indexed by n, and an $η^{*}$ satisfying $η^{* ⊤} {\hat{Σ}}_{X} η^{*} = I_{d}$ such that $η_{n} \overset{P}{⟶} η^{*}$ but $S p a n (η^{*}) \neq S p a n (η)$ .

By Lemma B.1, we have $C_{n}^{2} (η_{n}^{⊤} X | Y) - C_{n}^{2} ({η^{*}}^{⊤} X | Y) \overset{P}{⟶} 0$ and by Lemma 3 in Yin and Yuan (Citation2020), we have $C_{n}^{2} ({η^{*}}^{⊤} X, Y) \overset{a . s .}{⟶} C^{2} ({η^{*}}^{⊤} X | Y)$ . Therefore, $C_{n}^{2} (η_{n}^{⊤} X | Y) \overset{P}{⟶} C^{2} ({η^{*}}^{⊤} X | Y)$ .

On the other hand, because $η_{n} = {\arg max}_{β^{⊤} {\hat{Σ}}_{X} β = I_{d}} C_{n}^{2} (β^{⊤} X | Y)$ , we have $C_{n}^{2} (η_{n}^{⊤} X | Y) \geq C_{n}^{2} (η^{⊤} X | Y)$ . If we take the limit on both sides of the above inequality, we get $C^{2} ({η^{*}}^{⊤} X | Y) \geq C^{2} (η^{⊤} X | Y)$ . However, we have proved that under the assumption $P_{η (Σ_{X})}^{⊤} X ⫫ Q_{η (Σ_{X})}^{⊤} X$ , $η = {\arg max}_{β^{⊤} Σ_{X} β = I_{d}} C^{2} (β^{⊤} X | Y)$ , and we also assume that the central subspace is unique. Therefore, $C^{2} ({η^{*}}^{⊤} X | Y) \geq C^{2} (η^{⊤} X | Y)$ conflicts with the above assumption, so $η_{n}$ is a consistent estimator of a basis of the central subspace.

Appendix C

Proof of Proposition 2.4

Lagrange multiplier technique is used to prove the $\sqrt{n}$ -consistency of vec $(η_{n})$ in the Proposition 2.4 in Section 2.5 of the article. First, we introduce the following notations and conditions and we also give a new definition.

For a random sample $(X, Y) = {(X_{k}, Y_{k}) : k = 1, \dots, n}$ from the joint distribution of random vectors $X$ in $R^{p}$ and $Y$ in $R$ , let $L (ζ) = C^{2} (β^{⊤} X | Y) + λ^{⊤} (v e c (β^{⊤} Σ_{X} β) - v e c (I_{d}))$ and $L_{n} (ζ) = C_{n}^{2} (β^{⊤} X | Y) + λ^{⊤} (v e c (β^{⊤} {\hat{Σ}}_{X} β) - v e c (I_{d}))$ . Here $ζ = (\begin{matrix} (β) \\ λ \end{matrix}) \in R^{p d + d^{2}}$ , $β \in R^{p \times d}$ , $λ \in R^{d^{2}}$ , $Σ_{X}$ is the covariance matrix of X, and ${\hat{Σ}}_{X}$ is the sample estimate for $Σ_{X}$ . Let $η_{n} = \arg max_{β^{⊤} {\hat{Σ}}_{X} β = I_{d}} C_{n}^{2} (β^{⊤} X | Y)$ . Then there exists a $λ_{n}$ such that $(\begin{matrix} v e c (η_{n}) \\ λ_{n} \end{matrix})$ is a stationary point for $L_{n} (ζ)$ . Let $θ_{n} = (\begin{matrix} v e c (η_{n}) \\ λ_{n} \end{matrix})$ . Then $L_{n}^{'} (θ_{n}) = 0$ . Let $η$ be a basis of CS. Then under the assumption $P_{η (Σ_{X})}^{⊤} X ⫫ Q_{η (Σ_{X})}^{⊤} X$ , there exists a rotation matrix $Q : Q^{⊤} Q = I_{d}$ , such that $η Q = \arg max_{β^{⊤} Σ_{X} β = I_{d}} C^{2} (β^{⊤} X | Y)$ . Without loss of generality, we assume $Q = I_{d}$ here. Therefore, there exists a $λ_{0}$ such that $(\begin{matrix} v e c (η) \\ λ_{0} \end{matrix})$ is a stationary point for $L (ζ)$ . Let $θ = (\begin{matrix} v e c (η) \\ λ_{0} \end{matrix})$ .

In the proof, we need to take derivatives of $C^{2} (η^{⊤} X | Y)$ and $C_{n}^{2} (η^{⊤} X | Y)$ with respect to vec $(η)$ , so for the simplicity of notation, when we consider the derivatives of $C^{2} (η^{⊤} X | Y)$ and $C_{n}^{2} (η^{⊤} X | Y)$ , we use $C (η)$ and $C_{n} (η)$ to denote $C^{2} (η^{⊤} X | Y)$ and $C_{n}^{2} (η^{⊤} X | Y)$ , respectively.

Here are additional notations, which will be used later in the following proof. $I_{(d, d)}$ is the vec-permutation matrix. $I_{m}$ is a identity matrix with rank m, and $I_{m} (:, i)$ denotes the ith column of $I_{m}$ . $A \otimes B$ denotes the Kronecker product between matrix $A$ and $B$ . vec(·) is a vec operator.

Furthermore, we give the following definition and assumptions.

Definition A.1

Let $Δ (η) = {α : | | α - η | | \leq c}$ , where α is a $p \times d$ matrix, $α^{⊤} Σ_{X} α = I_{d}$ , c is a fixed small constant, and $| | \cdot | |$ is the Frobenius norm. We define an indicator function $ρ (X, X^{'}) = {\begin{matrix} 0, & if | α^{⊤} (X - X^{'}) | \leq ϵ_{0}, f o r α \in Δ (η), \\ 1, & if | α^{⊤} (X - X^{'}) | > ϵ_{0}, f o r α \in Δ (η), \end{matrix}$ where $X^{'}$ is an i.i.d. copy of $X$ and $ϵ_{0}$ is a small number. We define the second and third derivatives of $C (η)$ with respect to vec( $η$ ) as $C^{^{″}} (η) ρ (X, X^{'})$ and $C^{^{‴}} (η) ρ (X, X^{'})$ . For the simplicity of notation, we will still use $C^{^{″}} (η)$ and $C^{^{‴}} (η)$ to denote $C^{^{″}} (η) ρ (X, X^{'})$ and $C^{^{‴}} (η) ρ (X, X^{'})$ , respectively.

The reason we use this definition is that under Definition C.1, the second and third derivatives of $C (η)$ and $C_{n} (η)$ are bounded, near the neighbourhood of the central subspace.

Assumption C.1

Var $[ϕ^{(1)} (X, X^{'})]$ , Var $[ϕ^{(2 Y)} (X_{y}, X_{y}^{'})], y = 1, \dots, H$ , Var $[ϕ^{(3)} (X)]$ , Var $[ϕ^{(4)} ((X, X^{'}))]$ , Var $[ϕ^{(5)} (X)]$ , Var $[ϕ^{(6)} ((X, X^{'}))]$ , Var $[ϕ^{(7)} (X)]$ are all $< \infty$ . Here $\begin{aligned} ϕ^{(1)} (X, X^{'}) & = \frac{(I_{d} \otimes (X - X^{'})) (I_{d} \otimes (X - X^{'})^{⊤}) v e c (η)}{| (I_{d} \otimes (X - X^{'})^{⊤}) v e c (η) |}, \\ ϕ^{(2 y)} (X_{y}, X_{y}^{'}) & = \frac{(I_{d} \otimes (X_{y} - X_{y}^{'})) (I_{d} \otimes (X_{y} - X_{y}^{'})^{⊤}) v e c (η)}{| (I_{d} \otimes (X_{y} - X_{y}^{'})^{⊤}) v e c (η) |}, y = 1, \dots, H, \\ ϕ^{(3)} (X) & = (I_{d} \otimes X X^{⊤} η) (I_{d^{2}} + I_{d, d}^{⊤}) λ_{0}, \\ ϕ^{(4)} (X, X^{'}) & = \frac{1}{2} (I_{d} \otimes ({X X^{'}}^{⊤} + X^{'} X^{⊤}) η) (I_{d^{2}} + I_{d, d}^{⊤}) λ_{0}, \\ ϕ^{(5)} (X) & = v e c (η^{⊤} X X^{⊤} η), \\ ϕ^{(6)} (X, X^{'}) & = \frac{1}{2} v e c (η^{⊤} ({X X^{'}}^{⊤} + X^{'} X^{⊤}) η), \\ ϕ^{(7)} (X) & = v e c (X - E X) (X - E X)^{⊤} . \end{aligned}$ Here $(X, Y), (X^{'}, Y^{'})$ are i.i.d copies and $(X_{y}, Y_{y}), (X_{y}^{'}, Y_{y}^{'})$ are i.i.d copies in the yth slice.

Assumption C.2

$(\begin{matrix} C^{^{″}} (η) + L & (I_{d} \otimes Σ_{X} η) (I_{d^{2}} + I_{(d, d)}) \\ (I_{d^{2}} + I_{(d, d)}^{⊤}) (I_{d} \otimes η^{⊤} Σ_{X}) & 0 \end{matrix})$ is nonsingular.

Assumption C.1 is needed for Proposition 2.4 in the main article and Lemma C.1 in the next section, which is similar to the assumed conditions of Theorem 6.1.6 (Lehmann, Citation1999, Ch. 6). This assumption is required by the asymptotic properties of U-statistics.

Assumption C.2 is in the spirit of von Mises proposition (Serfling, Citation1980, Section 6.1). In this proposition, it claims that if the first nonvanishing term of Taylor expansion is the linear term, then the $\sqrt{n}$ -consistency of the differentiable statistical function can be achieved. In our case, we assume the corresponding matrix is nonsingular, which guarantees the $\sqrt{n}$ -consistency. If the matrix is singular, then n or higher order consistency of some parts of our estimates can be proved.

In order to prove Proposition 2.4 in Section 2.5 of the paper, we provide and prove the following Lemma C.1 first.

Lemma C.1

Under Assumptions C.1, C.2 and the assumptions in Proposition 2.4, then $\sqrt{n} (θ_{n} - θ) \overset{D}{⟶} N (0, V)$ . The explicit expression for $V$ is in the proof.

Proof

The Taylor expansion of $L_{n}^{'} (θ_{n})$ at $θ$ is $0 = L_{n}^{'} (θ_{n}) = L_{n}^{'} (θ) + L_{n}^{^{″}} (θ) (θ_{n} - θ) + R_{1} (θ_{n}^{*})$ , where $| | θ_{n}^{*} - θ | | \leq | | θ_{n} - θ | |$ , where $| | \cdot | |$ is the Frobenius norm and $θ_{n}^{*} = (\begin{matrix} v e c (η_{n}^{*}) \\ λ_{n}^{*} \end{matrix})$ . Next, we will give explicit expressions of $L_{n}^{'} (θ)$ , $L_{n}^{^{″}} (θ)$ and $R_{1} (θ_{n}^{*})$ . With simple calculation, $L_{n}^{'} (θ) = (\begin{matrix} C_{n}^{'} (η) + (I_{d} \otimes {\hat{Σ}}_{X} η) (I_{d^{2}} + I_{(d, d)}) λ_{0} v e c (η^{⊤} {\hat{Σ}}_{X} η) - v e c (I_{d}) \end{matrix})$ , $L_{n}^{^{″}} (θ) = (\begin{matrix} C_{n}^{^{″}} (η) + \hat{L} & (I_{d} \otimes {\hat{Σ}}_{X} η) (I_{d^{2}} + I_{(d, d)}) \\ (I_{d^{2}} + I_{(d, d)}^{⊤}) (I_{d} \otimes η^{⊤} {\hat{Σ}}_{X}) & 0 \end{matrix}),$ where $\hat{L} = (v e c ({\hat{L}}_{11}), v e c ({\hat{L}}_{21}), \dots, v e c ({\hat{L}}_{p 1}), \dots, v e c ({\hat{L}}_{1 d}), v e c ({\hat{L}}_{2 d}), \dots, v e c ({\hat{L}}_{p d}))^{⊤}$ and ${\hat{L}}_{i j} = {\hat{Σ}}_{X}^{⊤} I_{p} (:, i) λ_{0}^{⊤} (I_{d^{2}} + I_{(d, d)}^{⊤}) (I_{d} (:, j) \otimes I_{d}) .$ It is obvious that $\hat{L} \overset{a . s .}{⟶} L$ , where $L = (v e c (L_{11}), v e c (L_{21}), \dots, v e c (L_{p 1}), \dots, v e c (L_{1 d}), v e c (L_{2 d}), \dots, v e c (L_{p d}))^{⊤}$ and $L_{i j} = Σ_{X}^{⊤} I_{p} (:, i) λ_{0}^{⊤} (I_{d^{2}} + I_{(d, d)}^{⊤}) (I_{d} (:, j) \otimes I_{d}) .$ Here $i = 1, \dots, p$ and $j = 1, \dots, d .$

The remainder term $R_{1} (θ_{n}^{*})$ involves the third derivative of $L (ζ)$ at $θ_{n}^{*}$ . Let $T_{n} = L_{n}^{^{‴}} (θ_{n}^{*})$ , where $T_{n}$ is a $(p d + d^{2}) \times (p d + d^{2}) \times (p d + d^{2})$ array and each $T_{n} (j, :, :), j = 1, \dots, p d + d^{2}$ , is a $(p d + d^{2}) \times (p d + d^{2})$ matrix. Therefore, the form of $R_{1} (θ_{n}^{*})$ can be written as $R_{1} (θ_{n}^{*}) = \frac{1}{2} (\begin{matrix} (θ_{n} - θ)^{⊤} T_{n} (1, :, :) (θ_{n} - θ) \\ (θ_{n} - θ)^{⊤} T_{n} (2, :, :) (θ_{n} - θ) \\ ⋮ \\ (θ_{n} - θ)^{⊤} T_{n} (p d + d^{2}, :, :) (θ_{n} - θ) \end{matrix}) .$ Based on the above explicit expression of $L_{n}^{'} (θ)$ , $L_{n}^{^{″}} (θ)$ and $R_{1} (θ_{n}^{*})$ , the Taylor expansion of $L_{n}^{'} (θ_{n})$ at $θ$ can be written as $\begin{aligned} 0 & = (\begin{matrix} C_{n}^{'} (η) + (I_{d} \otimes {\hat{Σ}}_{X} η) (I_{d^{2}} + I_{(d, d)} λ_{0}) \\ v e c (η^{⊤} {\hat{Σ}}_{X} η) - v e c (I_{d}) \end{matrix}) \\ + (\begin{matrix} C_{n}^{″} (η) + \hat{L} & (I_{d} \otimes {\hat{Σ}}_{X} η) (I_{d^{2}} + I_{(d, d)}) \\ (I_{d^{2}} + I_{(d, d)}^{⊤}) (I_{d} \otimes η^{⊤} {\hat{Σ}}_{X} η) & 0 \end{matrix}) (θ_{n} - θ) \\ + \frac{1}{2} (\begin{matrix} (θ_{n} - θ)^{⊤} T_{n} (1, :, :) (θ_{n} - θ) \\ (θ_{n} - θ)^{⊤} T_{n} (2, :, :) (θ_{n} - θ) \\ ⋮ \\ (θ_{n} - θ)^{⊤} T_{n} (p d + d^{2}, :, :) (θ_{n} - θ) \end{matrix}) . \end{aligned}$ From the above Taylor expansion of $L_{n}^{'} (θ_{n})$ at $θ$ , we get $\begin{aligned} - {(\begin{matrix} C_{n}^{″} (η) + \hat{L} & (I_{d} \otimes {\hat{Σ}}_{X} η) (I_{d^{2}} + I_{(d, d)}) \\ (I_{d^{2}} + I_{(d, d)}^{⊤}) (I_{d} \otimes η^{⊤} {\hat{Σ}}_{X} η) & 0 \end{matrix})}^{- 1} \\ \times \sqrt{n} (\begin{matrix} C_{n}^{'} (η) + (I_{d} \otimes {\hat{Σ}}_{X} η) (I_{d^{2}} + I_{(d, d)} λ_{0}) \\ v e c (η^{⊤} {\hat{Σ}}_{X} η) - v e c (I_{d}) \end{matrix}) \\ = [I_{p d + d^{2}} + \frac{1}{2} {(\begin{matrix} C_{n}^{″} (η) + \hat{L} & (I_{d} \otimes {\hat{Σ}}_{X} η) (I_{d^{2}} + I_{(d, d)}) \\ (I_{d^{2}} + I_{(d, d)}^{⊤}) (I_{d} \otimes η^{⊤} {\hat{Σ}}_{X} η) & 0 \end{matrix})}^{- 1} \\ \times (\begin{matrix} (θ_{n} - θ)^{⊤} T_{n} (1, :, :) \\ (θ_{n} - θ)^{⊤} T_{n} (2, :, :) \\ ⋮ \\ (θ_{n} - θ)^{⊤} T_{n} (p d + d^{2}, :, :) \end{matrix})] \sqrt{n} (θ_{n} - θ) . \end{aligned}$ Next, we will prove two parts.

Part 1: $\begin{aligned} - {(\begin{matrix} C_{n}^{^{″}} (η) + \hat{L} & (I_{d} \otimes {\hat{Σ}}_{X} η) (I_{d^{2}} + I_{(d, d)}) \\ (I_{d^{2}} + I_{(d, d)}^{⊤}) (I_{d} \otimes η^{⊤} {\hat{Σ}}_{X}) & 0 \end{matrix})}^{- 1} \\ \times \sqrt{n} (\begin{matrix} C_{n}^{'} (η) + (I_{d} \otimes {\hat{Σ}}_{X} η) (I_{d^{2}} + I_{(d, d)}) λ_{0} \\ $ v e c $ (η^{⊤} {\hat{Σ}}_{X} η) - $ v e c $ (I_{d}) \end{matrix}) ⟶ N (0, V) . \end{aligned}$

Part 2: $\begin{aligned} \sqrt{n} (θ_{n} - θ) & \overset{D}{=} [I_{p d + d^{2}} + \frac{1}{2} {(\begin{matrix} C_{n}^{^{″}} (η) + \hat{L} & (I_{d} \otimes {\hat{Σ}}_{X} η) (I_{d^{2}} + I_{(d, d)}) \\ (I_{d^{2}} + I_{(d, d)}^{⊤}) (I_{d} \otimes η^{⊤} {\hat{Σ}}_{X}) & 0 \end{matrix})}^{- 1} \\ \times (\begin{matrix} (θ_{n} - θ)^{⊤} T_{n} (1, :, :) \\ (θ_{n} - θ)^{⊤} T_{n} (2, :, :) \\ ⋮ \\ (θ_{n} - θ)^{⊤} T_{n} (p d + d^{2}, :, :) \end{matrix})] \sqrt{n} (θ_{n} - θ) . \end{aligned}$

Proof of part 1

We will show that both $C_{n}^{'} (η) + (I_{d} \otimes {\hat{Σ}}_{X} η) (I_{d^{2}} + I_{(d, d)}) λ_{0}$ and vec $(η^{⊤} {\hat{Σ}}_{X} η) - v e c (I_{d})$ are linear combinations of U-statistics and the asymptotic distribution can be achieved by the asymptotic property of U-statistics.

Based on Corollary 1 in Yin and Yuan (Citation2020), $C_{n} (η) = \frac{1}{n^{2}} \sum_{k, l = 1}^{n, n} | η^{⊤} (X_{k} - X_{l}) | - \frac{1}{n} \sum_{y = 1}^{H} \frac{1}{n_{y}} \sum_{k, l = 1}^{n_{y}, n_{y}} | η^{⊤} (X_{y, k_{y}} - X_{y, l_{y}}) | .$ With some calculation, we can get $C_{n}^{'} (η) + (I_{d} \otimes {\hat{Σ}}_{X} η) (I_{d^{2}} + I_{(d, d)}) λ_{0} = \frac{n - 1}{n} U_{1 n} - \frac{1}{n} \sum_{y = 1}^{H} (n_{y} - 1) U_{2 y} + \frac{n - 1}{n} U_{3 n} - \frac{n - 1}{n} U_{4 n},$ where $\begin{aligned} U_{1 n} & = {(\binom{n}{2})}^{- 1} \sum_{1 \leq k < l \leq n} \frac{(I_{d} \otimes (X_{k} - X_{l})) (I_{d} \otimes (X_{k} - X_{l})^{⊤}) v e c (η)}{| (I_{d} \otimes (X_{k} - X_{l})^{⊤}) v e c (η) |}, \\ U_{2 y} & = {(\binom{n_{y}}{2})}^{- 1} \sum_{1 \leq k_{y} < l_{y} \leq n_{y}} \frac{(I_{d} \otimes (X_{y, k_{y}} - X_{y, l_{y}})) (I_{d} \otimes (X_{y, k_{y}} - X_{y, l_{y}})^{⊤}) v e c (η)}{| (I_{d} \otimes (X_{y, k_{y}} - X_{y, l_{y}})^{⊤}) v e c (η) |}, y = 1, \dots, H, \\ U_{3 n} & = \frac{1}{n} \sum_{i = 1}^{n} (I_{d} \otimes X_{i} X_{i}^{⊤} η) (I_{d^{2}} + I_{d, d}^{⊤}) λ_{0}, \\ U_{4 n} & = {(\binom{n}{2})}^{- 1} \sum_{i < j} \frac{1}{2} (I_{d} \otimes (X_{i} X_{j}^{⊤} + X_{j} X_{i}^{⊤}) η) (I_{d^{2}} + I_{d, d}^{⊤}) λ_{0} . \end{aligned}$ Here $U_{1 n}$ , $U_{2 y} (y = 1, \dots, H)$ , $U_{3 n}$ , $U_{4 n}$ are U-statistics. In the notation $U_{2 y}$ , $y = 1, \dots, H$ , $k_{y}, l_{y} = 1, \dots, n_{y}$ , where H denotes the number of slices and $n_{y}$ is the number of samples in the yth slice.

Considering the term vec $(η^{⊤} {\hat{Σ}}_{X} η)$ , which is also a linear combination of U-statistics, let $\begin{aligned} U_{5 n} & = \frac{1}{n} \sum_{i = 1}^{n} v e c (η^{⊤} X_{i} X_{i}^{⊤} η), \\ U_{6 n} & = {(\binom{n}{2})}^{- 1} \sum_{i < j} \frac{1}{2} v e c (η^{⊤} (X_{i} X_{j}^{⊤} + X_{j} X_{i}^{⊤}) η), \end{aligned}$ and then vec $(η^{⊤} {\hat{Σ}}_{X} η) = \frac{n - 1}{n} U_{5 n} - \frac{n - 1}{n} U_{6 n}$ .

Let $\begin{aligned} μ_{1} & = E \frac{(I_{d} \otimes (X - X^{'})) (I_{d} \otimes (X - X^{'})^{⊤}) v e c (η)}{| (I_{d} \otimes (X - X^{'})^{⊤}) v e c (η) |}, \\ μ_{2 y} & = E \frac{(I_{d} \otimes (X_{y} - X_{y}^{'})) (I_{d} \otimes (X_{y} - X_{y}^{'})^{⊤}) v e c (η)}{| (I_{d} \otimes (X_{y} - X_{y}^{'})^{⊤}) v e c (η) |}, y = 1, \dots, H, \\ μ_{3} & = E (I_{d} \otimes X X^{⊤} η) (I_{d^{2}} + I_{(d, d)}^{⊤}) λ_{0}, \\ μ_{4} & = (I_{d} \otimes (E X) (E X)^{⊤} η) (I_{d^{2}} + I_{(d, d)}^{⊤}) λ_{0}, \\ μ_{5} & = v e c (η^{⊤} (E X X^{⊤}) η), \\ μ_{6} & = v e c (η^{⊤} (E X) (E X)^{⊤} η) . \end{aligned}$ Here $(X, Y), (X^{'}, Y^{'})$ are i.i.d copies and $(X_{y}, Y_{y}) (X_{y}^{'}, Y_{y}^{'})$ are i.i.d copies in the yth slice.

According to Theorem 6.1.6 (Lehmann, Citation1999, Ch.6), $\sqrt{n} (\begin{matrix} U_{1 n} - μ_{1} \\ U_{21} - μ_{21} \\ ⋮ \\ U_{2 H} - μ_{2 H} \\ U_{3 n} - μ_{3} \\ U_{4 n} - μ_{4} \\ U_{5 n} - μ_{5} \\ U_{6 n} - μ_{6} \end{matrix}) \overset{D}{⟶} N (0, Σ),$ where $Σ = (\begin{matrix} Σ_{11} & \dots & Σ_{1 (H + 5)} \\ ⋮ & ⋱ & ⋮ \\ Σ_{(H + 5) 1} & \dots & Σ_{(H + 5) (H + 5)} \end{matrix}) .$

Let $B = (\begin{matrix} I_{p d} & (- \frac{1}{H}) I_{p d} & \dots & (- \frac{1}{H}) I_{p d} & I_{p d} & - I_{p d} & 0 & 0 \\ 0^{⊤} & 0^{⊤} & \dots & 0^{⊤} & 0^{⊤} & 0^{⊤} & I_{d^{2} \times d^{2}} & - I_{d^{2} \times d^{2}} \end{matrix})$ , where $0$ is a $p d \times d^{2}$ zero matrix. Then $\sqrt{n} B (\begin{matrix} U_{1 n} - μ_{1} \\ U_{21} - μ_{21} \\ ⋮ \\ U_{2 H} - μ_{2 H} \\ U_{3 n} - μ_{3} \\ U_{4 n} - μ_{4} \\ U_{5 n} - μ_{5} \\ U_{6 n} - μ_{6} \end{matrix}) = \sqrt{n} (\begin{matrix} U_{1 n} - \frac{1}{H} \sum_{y = 1}^{H} U_{2 y} + U_{3 n} - U_{4 n} \\ U_{5 n} - U_{6 n} - v e c (I_{d}) \end{matrix}) .$

Note that $\sqrt{n} (\begin{matrix} C_{n}^{'} (η) + (I_{d} \otimes {\hat{Σ}}_{X} η) (I_{d^{2}} + I_{(d, d)}) λ_{0} \\ v e c (η^{⊤} {\hat{Σ}}_{X} η) - v e c (I_{d}) \end{matrix}) = \sqrt{n} (\begin{matrix} \frac{n - 1}{n} U_{1 n} - \frac{1}{n} \sum_{y = 1}^{H} (n_{y} - 1) U_{2 y} + \frac{n - 1}{n} U_{3 n} - \frac{n - 1}{n} U_{4 n} \\ \frac{n - 1}{n} U_{5 n} - \frac{n - 1}{n} U_{6 n} - v e c (I_{d}) \end{matrix}),$ and under Assumption C.1, $\begin{aligned} \sqrt{n} (\begin{matrix} \frac{(n - 1)}{n} U_{1 n} - \frac{1}{n} \sum_{y = 1}^{H} (n_{y} - 1) U_{2 y} + \frac{n - 1}{n} U_{3 n} - \frac{n - 1}{n} U_{4 n} \\ \frac{n - 1}{n} U_{5 n} - \frac{n - 1}{n} U_{6 n} - v e c (I_{d}) \end{matrix}) \\ - \sqrt{n} (\begin{matrix} U_{1 n} - \frac{1}{H} \sum_{y = 1}^{H} U_{2 y} + U_{3 n} + U_{4 n} \\ U_{5 n} - U_{6 n} - v e c (I_{d}) \end{matrix}) \overset{P}{⟶} 0. \end{aligned}$ Therefore, according to Slutsky's theorem, $\begin{aligned} \sqrt{n} (\begin{matrix} C_{n}^{'} (η) + (I_{d} \otimes {\hat{Σ}}_{X} η) (I_{d^{2}} + I_{(d, d)}) λ_{0} \\ v e c (η^{⊤} {\hat{Σ}}_{X} η) - v e c (I_{d}) \end{matrix}) \overset{D}{=} \sqrt{n} B (\begin{matrix} U_{1 n} - μ_{1} \\ U_{21} - μ_{21} \\ ⋮ \\ U_{2 H} - μ_{2 H} \\ U_{3 n} - μ_{3} \\ U_{4 n} - μ_{4} \\ U_{5 n} - μ_{5} \\ U_{6 n} - μ_{6} \end{matrix}) . \end{aligned}$

Let $\begin{aligned} A_{n} & = {(\begin{matrix} C_{n}^{^{″}} (η) + \hat{L} & (I_{d} \otimes {\hat{Σ}}_{X} η) (I_{d^{2}} + I_{(d, d)}) \\ (I_{d^{2}} + I_{(d, d)}^{⊤}) (I_{d} \otimes η^{⊤} {\hat{Σ}}_{X}) & 0 \end{matrix})}^{- 1}, \\ A & = {(\begin{matrix} C^{^{″}} (η) + L & (I_{d} \otimes Σ_{X} η) (I_{d^{2}} + I_{(d, d)}) \\ (I_{d^{2}} + I_{(d, d)}^{⊤}) (I_{d} \otimes η^{⊤} Σ_{X}) & 0 \end{matrix})}^{- 1} . \end{aligned}$

Under Assumption C.2 and our definition of second derivative of $C_{n} (η)$ , by SLLN of U-statistics, $A_{n} \overset{a . s .}{⟶} A$ . Therefore, $\begin{aligned} {(\begin{matrix} C_{n}^{^{″}} (η) + \hat{L} & (I_{d} \otimes {\hat{Σ}}_{X} η) (I_{d^{2}} + I_{(d, d)}) \\ (I_{d^{2}} + I_{(d, d)}^{⊤}) (I_{d} \otimes η^{⊤} {\hat{Σ}}_{X}) & 0 \end{matrix})}^{- 1} \\ \times \sqrt{n} (\begin{matrix} C_{n}^{'} (η) + (I_{d} \otimes {\hat{Σ}}_{X} η) (I_{d^{2}} + I_{(d, d)}) λ_{0} \\ v e c (η^{⊤} {\hat{Σ}}_{X} η) - v e c (I_{d}) \end{matrix}) \overset{D}{=} \sqrt{n} A B (\begin{matrix} U_{1 n} - μ_{1} \\ U_{21} - μ_{21} \\ ⋮ \\ U_{2 H} - μ_{2 H} \\ U_{3 n} - μ_{3} \\ U_{4 n} - μ_{4} \\ U_{5 n} - μ_{5} \\ U_{6 n} - μ_{6} \end{matrix}) ⟶ N (0, V), \end{aligned}$ where $V = A B Σ B^{⊤} A^{⊤}$ .

Proof of part 2

Under Assumption C.2 and Definition C.1, $I_{p d + d^{2}} + \frac{1}{2} {(\begin{matrix} C_{n}^{^{″}} (η) + \hat{L} & (I_{d} \otimes {\hat{Σ}}_{X} η) (I_{d^{2}} + I_{(d, d)}) \\ (I_{d^{2}} + I_{(d, d)}^{⊤}) (I_{d} \otimes η^{⊤} {\hat{Σ}}_{X}) & 0 \end{matrix})}^{- 1} (\begin{matrix} (θ_{n} - θ)^{⊤} T_{n} (1, :, :) \\ (θ_{n} - θ)^{⊤} T_{n} (2, :, :) \\ ⋮ \\ (θ_{n} - θ)^{⊤} T_{n} (p d + d^{2}, :, :) \end{matrix}) \overset{P}{⟶} I_{p d + d^{2}} .$ Therefore, by Slutsky's theorem, $\begin{aligned} \sqrt{n} (θ_{n} - θ) & \overset{D}{=} [I_{p d + d^{2}} + \frac{1}{2} {(\begin{matrix} C_{n}^{^{″}} (η) + (\begin{matrix} {v e c}^{⊤} ({\hat{L}}_{11}) \\ ⋮ \\ {v e c}^{⊤} ({\hat{L}}_{p d}) \end{matrix}) & (I_{d} \otimes {\hat{Σ}}_{X} η) (I_{d^{2}} + I_{(d, d)}) \\ (I_{d^{2}} + I_{(d, d)}^{⊤}) (I_{d} \otimes η^{⊤} {\hat{Σ}}_{X}) & 0 \end{matrix})}^{- 1} (\begin{matrix} (θ_{n} - θ)^{⊤} T_{n} (1, :, :) \\ (θ_{n} - θ)^{⊤} T_{n} (2, :, :) \\ ⋮ \\ (θ_{n} - θ)^{⊤} T_{n} (p d + d^{2}, :, :) \end{matrix})] \\ \times \sqrt{n} (θ_{n} - θ) . \end{aligned}$ Therefore, $\sqrt{n} (θ_{n} - θ) \overset{D}{⟶} N (0, V)$ , or in other words, $θ_{n}$ is $\sqrt{n}$ -consistent estimation of $θ$ .

In the above proof, without loss of generality, we assume that $Q = I_{d}$ . Note that with an orthogonal matrix $Q$ , $C_{n}^{2} (Q^{⊤} β^{⊤} X, Y) = C_{n}^{2} (β^{⊤} X, Y)$ and $C^{2} (Q^{⊤} β^{⊤} X, Y) = C^{2} (β^{⊤} X, Y)$ (Yin & Yuan, Citation2020). If define $η_{Q} = η Q$ , without assuming $Q = I_{d}$ , then Lemma C.1 holds by using $C (η_{Q})$ which is obtained by replacing every $η$ in $C (η)$ with $η_{Q}$ . (Of course, then $C (η_{I_{d}}) = C (η)$ in the proof).

Proof of Proposition 2.4

Let $G = (I_{p d}, 0)$ be a $p d \times (p d + d^{2})$ matrix, where $I_{p d}$ is a $p d \times p d$ identity matrix. Then vec $(η_{n}) = G θ_{n}$ and vec $(η Q) = G θ$ . By Lemma C.1, we have $\sqrt{n} (v e c (η_{n}) - v e c (η Q)) = \sqrt{n} G (θ_{n} - θ) \overset{D}{⟶} N (0, V_{11} (η_{Q}))$ , or in other word, $\sqrt{n} [v e c (η_{n}) - v e c (η Q)] \overset{D}{⟶} N (0, V_{11} (η_{Q}))$ , where $V_{11} (η_{Q}) = G V (η_{Q}) G^{⊤}$ .

Dimension reduction with expectation of conditional difference measure

Abstract

1. Introduction

2. Methodology

2.1. A measure of divergence

2.2. Review of sufficient dimension reduction

2.3. The new sufficient dimension reduction method

2.4. Estimating the CS when d is specified

2.5. Theoretical properties

2.6. Estimating structural dimension d

3. Simulation studies

Table 1. Estimation accuracy report for Model 1.

Table 2. Estimation accuracy report for Model 2.

Table 3. Estimation accuracy report for Model 3.

Table 4. Accuracy of estimating d with kNN procedure.

4. Real data analysis

5. Discussion

Disclosure statement

References

Appendix A

Proofs of Propositions 2.1 and 2.2

Appendix B

Proof of Proposition 2.3

Appendix C

Proof of Proposition 2.4

Information for

Open access

Opportunities

Help and information

Dimension reduction with expectation of conditional difference measure

Abstract

1. Introduction

2. Methodology

2.1. A measure of divergence

2.2. Review of sufficient dimension reduction

2.3. The new sufficient dimension reduction method

2.4. Estimating the CS when d is specified

2.5. Theoretical properties

2.6. Estimating structural dimension d

3. Simulation studies

Table 1. Estimation accuracy report for Model 1.

Table 2. Estimation accuracy report for Model 2.

Table 3. Estimation accuracy report for Model 3.

Table 4. Accuracy of estimating d with kNN procedure.

4. Real data analysis

5. Discussion

Disclosure statement

References

Appendix A

Proofs of Propositions 2.1 and 2.2

Appendix B

Proof of Proposition 2.3

Appendix C

Proof of Proposition 2.4

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date