Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

With the development of modern science and technology, more and more high-dimensional data appear in the application fields. Since the high dimension can potentially increase the complexity of the covariance structure, comparing the covariance matrices among populations is strongly motivated in high-dimensional data analysis. In this article, we consider the proportionality test of two high-dimensional covariance matrices, where the data dimension is potentially much larger than the sample sizes, or even larger than the squares of the sample sizes. We devise a novel high-dimensional spatial rank test that has much-improved power than many existing popular tests, especially for the data generated from some heavy-tailed distributions. The asymptotic normality of the proposed test statistics is established under the family of elliptically symmetric distributions, which is a more general distribution family than the normal distribution family, including numerous commonly used heavy-tailed distributions. Extensive numerical experiments demonstrate the superiority of the proposed test in terms of both empirical size and power. Then, a real data analysis demonstrates the practicability of the proposed test for high-dimensional gene expression data.

Keywords:

1. Introduction

High-dimensional data are nowadays more and more common in bioinformatics, material science, astronomy and other application fields, as data collection technology rapidly evolves (Bühlmann & van de Geer, Citation2011). However, due to limited resources available to replicate observations, the sample sizes are usually much smaller than the dimension, which makes most traditional statistical approaches no longer appropriate. Under such an embarrassing background, scientists in many application fields urgently need powerful approaches to gather the greatest scientific insight from data. Testing equality of the distributions of two populations is a crucial problem in high-dimensional statistics, which is extremely complex and far more challenging than that for fixed-dimensional data. Due to this extreme complexity, it is usually replaced by a simpler problem, i.e. testing equality of some numerical characteristics, such as means and covariances, of the two populations, which is very useful but much easier to implement.

There is already a large number of literature on detecting the difference between the means of two high-dimensional populations, such as Bai and Saranadasa (Citation1996), Chen and Qinm (Citation2010), and Feng et al. (Citation2016), to name just a few. In contrast, there are much fewer studies on high-dimensional covariance matrix test of two high-dimensional populations. Hence, in this article, we focus on comparing the covariance matrices among two populations, which is strongly motivated for high-dimensional data, as high data dimensions can potentially increase the complexity of the covariance structure (Li & Chen, Citation2012). In particular, we consider the testing problem of the proportionality of two high-dimensional covariance matrices, which investigates the simplest heteroscedasticity of the population covariance matrices (Xu et al., Citation2014). It is often a preparation procedure before the case–control analysis of genomic data. Let $X$ and $Y$ be two p-dimensional populations with the mean vectors $μ_{1}$ , $μ_{2}$ and the covariance matrices $Σ_{1}$ , $Σ_{2}$ , respectively. The proportionality test of two population covariance matrices is formulated as follows: (1) $H_{0} : Σ_{1} = c Σ_{2} v e r s u s H_{1} : Σ_{1} \neq c Σ_{2},$ (1) where c is an unknown scalar.

The proportionality testing problem in (Equation1(1) $H_{0} : Σ_{1} = c Σ_{2} v e r s u s H_{1} : Σ_{1} \neq c Σ_{2},$ (1) ) has been widely studied in various areas, such as in discriminant analysis and principal component analysis (Flury & Riedwyl, Citation1988; Schott, Citation1991), and there is a lot of early literature on its methodological researches, such as Eriksen (Citation1987), Federer (Citation1951), Flury (Citation1986), Kim (Citation1971), Rao (Citation1983), and Schott (Citation1999). For example, the most traditional test statistic is $\begin{aligned} (n_{1} + n_{2}) \sum_{j = 1}^{p} \log ({\hat{λ}}_{j}) - n_{1} \log (| {\hat{Σ}}_{1} |) \\ + n_{2} (p \log (\hat{c} - \log (| {\hat{Σ}}_{2} |))) \overset{d}{\to} χ_{(p^{2} + p - 2) / 2}^{2}, \end{aligned}$ where $({\hat{λ}}_{j}, \hat{c})$ are obtained by an iterative algorithm proposed in Flury (Citation1986) and ${\hat{Σ}}_{1}, {\hat{Σ}}_{2}$ are the corresponding sample covariance matrices, respectively. These researches are constructed based on the classical limit theorems, assuming that the sample sizes tend to infinity and the dimension is fixed, hence have difficulties to analyse the high-dimensional data, where the dimension is much larger than the sample sizes. To alleviate such difficulties, Xu et al. (Citation2014) proposed to use a pseudo-likelihood ratio test by extending the traditional likelihood ratio test with the statistic $p \log (p^{- 1} t r ({\hat{Σ}}_{1} {\hat{Σ}}_{2}^{- 1})) - \log | {\hat{Σ}}_{1} {\hat{Σ}}_{2}^{- 1} |,$ which allows the dimension to increase proportionally with each sample size; furthermore, Liu et al. (Citation2014) proposed an improved method, which allows the dimension to be larger than one of the sample sizes. In addition, for the special case of $c \equiv 1$ in (Equation1(1) $H_{0} : Σ_{1} = c Σ_{2} v e r s u s H_{1} : Σ_{1} \neq c Σ_{2},$ (1) ), Li and Chen (Citation2012) proposed a test statistic $T_{L C} = A_{n_{1}} + A_{n_{2}} - 2 C_{n_{1}, n_{2}},$ where $\begin{aligned} A_{n_{i}} & = \frac{1}{n_{i} (n_{i} - 1)} \sum_{k \neq l} (X_{i k}^{⊤} X_{i l})^{2} \\ - \frac{2}{n_{i} (n_{i} - 1) (n_{i} - 2)} \sum_{j, k, l n o t e q u a l} X_{i k}^{⊤} X_{i j} X_{i j}^{⊤} X_{i l} \\ + \frac{1}{n_{i} (n_{i} - 1) (n_{i} - 2) (n_{i} - 3)} \\ \sum_{j, k, l, t n o t e q u a l} X_{i k}^{⊤} X_{i j} X_{i t}^{⊤} X_{i l}, \\ C_{n_{1}, n_{2}} & = \frac{1}{n_{1} n_{2}} \sum_{k = 1}^{n_{1}} \sum_{l = 1}^{n_{2}} (X_{1 k}^{⊤} X_{2 l})^{2} \\ - \frac{1}{n_{1} n_{2} (n_{1} - 1)} \sum_{k \neq l}^{n_{1}} \sum_{j = 1}^{n_{2}} X_{1 k}^{⊤} X_{2 j} X_{2 j}^{⊤} X_{1 l} \\ - \frac{1}{n_{1} n_{2} (n_{2} - 1)} \sum_{k \neq l}^{n_{2}} \sum_{j = 1}^{n_{1}} X_{2 k}^{⊤} X_{1 j} X_{1 j}^{⊤} X_{2 l} \\ + \frac{1}{n_{1} n_{2} (n_{1} - 1) (n_{2} - 1)} \\ \times \sum_{k \neq t}^{n_{1}} \sum_{j \neq l}^{n_{2}} X_{1 k}^{⊤} X_{2 j} X_{1 t}^{⊤} X_{2 l} . \end{aligned}$ As mentioned in Li and Chen (Citation2012), $T_{L C}$ is an unbiased estimation of $t r {(Σ_{1} - Σ_{2})^{2}}$ . Despite some progress, there are also drawbacks: first, these methods may have extremely poor performance for heavy-tailed distributions; second, the sample covariance matrices, which need to be inverted in the construction of the test statistic, are singular when the dimension is larger than both of the sample sizes.

To overcome these two drawbacks, more attention has been paid to nonparametric testing methods based on the multivariate sign or rank. Just recently, for testing the proportionality of two high-dimensional covariance matrices, Cheng et al. (Citation2018) proposed to use a test procedure based on the multivariate sign and demonstrated its good performance in high-dimensional data analysis, especially for the heavy-tailed distributions. Recall that for fixed-dimensional data, the multivariate sign and rank are widely used to construct robust tests (Oja, Citation2010). However, most of these tests cannot be effective for high-dimensional data. Therefore, many researches extend the traditional multivariate sign- or rank-based testing methods to the high-dimension data, such as Feng and Sun (Citation2016) and Wang et al. (Citation2015) for one-sample problems; Feng et al. (Citation2016) for two-sample problems; Feng and Liu (Citation2017) and Zou et al. (Citation2014) for sphericity testing problems. These researches clearly demonstrate the advantages of the high-dimensional multivariate sign- or rank-based methods in high-dimensional and heavy-tailed cases.

Unfortunately, due to the bias caused by estimating the location parameters, the test procedure based on the multivariate sign can only allow the dimension to be the squares of the sample sizes at most (Cheng et al., Citation2018), which makes the test procedure too restrictive for various practical applications, hence greatly affects the validity of the test procedure. For example, in genomic data analysis, genomic data typically carry thousands of dimensions for measurements on the genome, where the dimension can be much larger than the squares of the sample sizes. Therefore, it is very urgent to develop a new method to deal with the proportionality testing problem in (Equation1(1) $H_{0} : Σ_{1} = c Σ_{2} v e r s u s H_{1} : Σ_{1} \neq c Σ_{2},$ (1) ) for the high-dimensional data, where the dimension is much higher than the squares of the sample sizes. This is the motivation and intention of this article.

The rest of the article is organized as follows. In Section 2, we introduce the proposed high-dimensional spatial rank test and establish its asymptotic normality under the elliptically symmetric populations. Then, we demonstrate the numerical performance of the proposed test in Sections 3, followed by a real data analysis in Section 4. Finally, we conclude this article in Section 5 and relegate the technical proofs to Appendix.

2. Method

2.1. The proposed test

A p-dimensional random vector $Z$ is said to follow an elliptically symmetric distribution, denoted by $E_{p} (μ, Λ, F_{ξ})$ , if it has the following stochastic representation: $Z = μ + ξ A U,$ where $μ$ is the p-dimensional mean vector, ξ is a non-negative random variable, $F_{ξ}$ is the cumulative distribution function of ξ, $U$ is independent of ξ and is uniformly distributed on the unit sphere $R^{p}$ and $A$ is a deterministic $p \times p$ -dimensional matrix satisfying $A A^{T} = Λ$ with $t r (Λ) = 1$ . It is known that the covariance matrix $Σ$ and shape matrix $Λ$ of the elliptical symmetric population $Z$ will satisfy the equation $Σ = p^{- 1} E (ξ^{2}) Λ$ .

Let $X_{1}, \dots, X_{n_{1}}$ and $Y_{1}, \dots, Y_{n_{2}}$ denote the samples of two p-dimensional random vectors $X$ and $Y$ , which are generated from the two independent elliptically symmetric populations $E_{p} (μ_{1}, Λ_{1}, F_{ξ_{1}})$ and $E_{p} (μ_{2}, Λ_{2}, F_{ξ_{2}})$ , respectively. From Section 3.1 in Magyar and Tyler (Citation2014), it is known that $Σ_{i}$ and $§_{i}$ have the same eigenvectors for each $i \in {1, 2}$ under the assumption of elliptically symmetric distribution. Also, from Equation 3.9 in Magyar and Tyler (Citation2014), it is known that when the eigenvalues of the covariance matrices $Σ_{1}$ and $Σ_{2}$ are proportional, the spatial sign covariance matrices $§_{1}$ and $§_{2}$ have the same eigenvalues. Theorem 1 in Cheng et al. (Citation2018) showed that when $§_{1}$ and $§_{2}$ have the same eigenvalues, the eigenvalues of $Σ_{1}$ and $Σ_{2}$ are proportional. Hence, the hypotheses in (Equation1(1) $H_{0} : Σ_{1} = c Σ_{2} v e r s u s H_{1} : Σ_{1} \neq c Σ_{2},$ (1) ) are equivalent to the following hypotheses: (2) $H_{0} : §_{1} = §_{2} v e r s u s H_{1} : §_{1} \neq §_{2},$ (2) where $§_{1} = E {U (X - μ_{1}) U (X - μ_{1})^{T}}$ , $§_{2} = E {U (Y - μ_{2}) U (Y - μ_{2})^{T}}$ are the spatial sign covariance matrices of $X$ , $Y$ , respectively, and $U (z) = \frac{z}{‖ z ‖} I (z \neq 0)$ for each $z \in R^{p}$ is the spatial sign function with $‖ \cdot ‖$ denoting the $L_{2}$ -norm and $I (\cdot)$ denoting the indicator function. On this ground, Cheng et al. (Citation2018) suggested to use a test statistics based on the square Frobenius norm of $§_{1} - §_{2}$ , i.e. $t r {(§_{1} - §_{2})^{2}}$ .

The proposed spatial rank test in this article is also based on the square Frobenius norm of $§_{1} - §_{2}$ , which is a high-dimensional extension of Kendall's tau test for the hypotheses in (Equation2(2) $H_{0} : §_{1} = §_{2} v e r s u s H_{1} : §_{1} \neq §_{2},$ (2) ) (Oja, Citation2010). Specifically, the test statistic is (3) $\begin{aligned} T_{H T} & = \frac{p}{n_{1} (n_{1} - 1) (n_{1} - 2) (n_{1} - 3)} \\ \times \sum^{*} {U (X_{i} - X_{j})^{T} U (X_{k} - X_{l})}^{2} \\ + \frac{p}{n_{2} (n_{2} - 1) (n_{2} - 2) (n_{2} - 3)} \\ \times \sum^{*} {U (Y_{i} - Y_{j})^{T} U (Y_{k} - Y_{l})}^{2} \\ - \frac{2 p}{n_{1} (n_{1} - 1) n_{2} (n_{2} - 1)} \\ \times \sum_{i = 1}^{n_{1}} \sum_{j \neq i}^{n_{1}} \sum_{k = 1}^{n_{2}} \sum_{l \neq k}^{n_{2}} {U (X_{i} - X_{j})^{T} U (Y_{k} - Y_{l})}^{2}, \end{aligned}$ (3) where $\sum^{*}$ denotes summation over distinct indexes ${i, j, k, l} \subseteq {1, \dots, n_{1}}$ or ${1, \dots, n_{2}}$ . Note that recently many developed versions of Kendall's tau test are frequently used on many related issues (Barber & Kolar, Citation2018; Cai & Zhang, Citation2016; Han et al., Citation2017; Leung & Drton, Citation2018).

In deriving the asymptotic properties of $T_{H T}$ , we impose the following two conditions used in Cheng et al. (Citation2018):

$n_{1} / (n_{1} + n_{2}) \to κ \in (0, 1)$ as $min {n_{1}, n_{2}} \to \infty$ ;
$t r (Λ_{i} Λ_{j} Λ_{k} Λ_{l}) = o {t r (Λ_{i} Λ_{j}) t r (Λ_{k} Λ_{l})}$ for $i, j, k, l \in {1, 2}$ .

Note that: (1) Condition (C1) is a commonly used condition in high-dimensional two sample testing problems; (2) Condition (C2) is similar to Condition (A2) in Li and Chen (Citation2012); (3) If all the eigenvalues of $Σ_{1}$ and $Σ_{2}$ are bounded, Condition (C2) holds.

Remark 2.1

Note that the above Conditions (C1) and (C2) do not contain any restriction on p and $n_{1}$ , $n_{2}$ , since such restriction is not needed to control the following terms: $\begin{aligned} \sum_{i, j, k, l} {U (X_{i} - X_{j})^{T} U (X_{k} - X_{l})}^{2} \\ - \sum^{*} {U (X_{i} - X_{j})^{T} U (X_{k} - X_{l})}^{2}, \\ \sum_{i, j, k, l} {U (Y_{i} - Y_{j})^{T} U (Y_{k} - Y_{l})}^{2} \\ - \sum^{*} {U (Y_{i} - Y_{j})^{T} U (Y_{k} - Y_{l})}^{2}, \\ \sum_{i = 1}^{n_{1}} \sum_{j = 1}^{n_{1}} \sum_{k = 1}^{n_{2}} \sum_{l = 1}^{n_{2}} {U (X_{i} - X_{j})^{T} U (Y_{k} - Y_{l})}^{2} \\ - \sum_{i = 1}^{n_{1}} \sum_{j \neq i}^{n_{1}} \sum_{k = 1}^{n_{2}} \sum_{l \neq k}^{n_{2}} {U (X_{i} - X_{j})^{T} U (Y_{k} - Y_{l})}^{2}, \end{aligned}$ which have been removed from $T_{H T}$ . That is to say, we remove all the items that include at least one pair of identical vectors, such as ${U (X_{i} - X_{j})^{T} U (X_{i} - X_{j})}^{2}$ , ${U (X_{i} - X_{j})^{T} U (X_{i} - X_{l})}^{2}$ and so on. Such type of strategy was previously used in Chen and Qinm (Citation2010). By removing the terms $\sum_{i} X_{i}^{T} X_{i}$ and $\sum_{k} Y_{k}^{T} Y_{k}$ from the test statistic proposed by Chen and Qinm (Citation2010), no restriction on p, $n_{1}$ and $n_{2}$ is needed.

Under the above two conditions, the limiting null distribution of $T_{H T}$ is given in the following theorem.

Theorem 2.1

Under Conditions (C1), (C2) and $H_{0}$ , as $n_{1}$ , $n_{2}$ , $p \to \infty$ , $σ_{0, n}^{- 1} T_{H T} \overset{d}{\to} N (0, 1),$ where $σ_{0, n}^{2} = 4 (n_{1}^{- 1} + n_{2}^{- 1})^{2} (p + 2)^{- 2} {t r}^{2} (Λ^{2})$ with $Λ = Λ_{1} = Λ_{2}$ .

Moreover, we obtain the limiting distribution of $T_{H T}$ under $H_{1}$ .

Theorem 2.2

Under Conditions (C1), (C2) and $H_{1}$ , as $n_{1}$ , $n_{2}$ , $p \to \infty$ , $σ_{n}^{- 1} [T_{H T} - p t r {(§_{1} - §_{2})^{2}}] \overset{d}{\to} N (0, 1),$ where $\begin{aligned} σ_{n}^{2} & = \frac{4}{n_{1} (n_{1} - 1)} \frac{{t r}^{2} (Λ_{1}^{2})}{(p + 2)^{2}} + \frac{8}{n_{1}} \frac{p t r (Λ_{1}^{4}) - {t r}^{2} (Λ_{1}^{2})}{p^{2} (p + 2)} \\ \times \frac{4}{n_{2} (n_{2} - 1)} \frac{{t r}^{2} (Λ_{2}^{2})}{(p + 2)^{2}} + \frac{8}{n_{2}} \frac{p t r (Λ_{2}^{4}) - {t r}^{2} (Λ_{2}^{2})}{p^{2} (p + 2)} \\ + \frac{8}{n_{1} n_{2}} \frac{{t r}^{2} (Λ_{1} Λ_{2})}{(p + 2)^{2}} + (\frac{8}{n_{1}} + \frac{8}{n_{2}}) \\ \times \frac{p t r (Λ_{1} Λ_{2})^{2} - {t r}^{2} (Λ_{1} Λ_{2})}{p^{2} (p + 2)} \\ - \frac{16}{n_{1}} \frac{p t r (Λ_{1}^{3} Λ_{2}) - t r (Λ_{1} Λ_{2}) t r (Λ_{1}^{2})}{p^{2} (p + 2)} \\ - \frac{16}{n_{2}} \frac{p t r (Λ_{2}^{3} Λ_{1}) - t r (Λ_{1} Λ_{2}) t r (Λ_{2}^{2})}{p^{2} (p + 2)} . \end{aligned}$

Due to the fact that $p t r (§_{l}^{2}) = p^{- 1} t r (Λ_{l}^{2}) {1 + o (1)}$ for l = 1, 2 obtained by Cheng et al. (Citation2018), we propose to use the following estimator of $σ_{0, n}^{2}$ : $\begin{aligned} {\hat{σ}}_{0, n}^{2} & = 4 (n_{1}^{- 1} + n_{2}^{- 1})^{2} (n_{1} + n_{2})^{- 1} (p + 2)^{- 2} \\ \times p^{2} (n_{1} A_{1} + n_{2} A_{2}), \end{aligned}$ where $\begin{aligned} A_{1} & = \frac{1}{n_{1} (n_{1} - 1) (n_{1} - 2) (n_{1} - 3)} \\ \times \sum^{*} {U (X_{i} - X_{j})^{T} U (X_{k} - X_{l})}^{2}, \\ A_{2} & = \frac{1}{n_{2} (n_{2} - 1) (n_{2} - 2) (n_{2} - 3)} \\ \times \sum^{*} {U (Y_{i} - Y_{j})^{T} U (Y_{k} - Y_{l})}^{2} . \end{aligned}$ As presented by the following proposition, ${\hat{σ}}_{0, n}^{2}$ is a consistent estimator of $σ_{0, n}^{2}$ under $H_{0}$ .

Proposition 2.1

Under Conditions (C1), (C2) and $H_{0}$ , ${\hat{σ}}_{0, n}^{2} / σ_{0, n}^{2} \to 1$ .

Therefore, the proposed test with a nominal α level of significance rejects $H_{0}$ if $T_{H T} \geq z_{α} {\hat{σ}}_{0, n}$ , where $z_{α}$ is the upper α-quantile of $N (0, 1)$ . The asymptotic power function of $T_{H T}$ is $\begin{aligned} β_{n_{1}, n_{2}} (§_{1}, §_{2}, α) \\ = Φ {- σ_{n}^{- 1} σ_{0, n} z_{α} + p σ_{n}^{- 1} t r (§_{1} - §_{2})^{2}}, \end{aligned}$ where $Φ (\cdot)$ denotes the cumulative probability function of $N (0, 1)$ .

2.2. Relationship with the test proposed in Cheng et al. (Citation2018)

The proposed spatial rank test seems to be more complex than the existing ones, such as the spatial sign test proposed by Cheng et al. (Citation2018). This is a price that we have to pay for making the proposed method powerful in testing the high-dimensional data, where the data dimension is potentially much larger than the squares of the sample sizes, especially for the data generated from heavy-tailed distributions. Below we will explain the motivation of the proposed method in detail.

First, we recall Lemma B.1 in Han and Liu (Citation2018).

Lemma 2.3

Let $X$ , $\tilde{X} \sim E_{p} (μ, Λ, F_{ξ})$ , where $X$ and $\tilde{X}$ are independent, then $E {U (X - \tilde{X}) U (X - \tilde{X})^{T}} = E {U (X - μ) U (X - μ)} .$

By Lemma 2.3, we have that $\begin{aligned} E {U (X_{i} - X_{j}) U (X_{i} - X_{j})^{T}} \\ = E {U (X_{i} - μ_{1}) U (X_{i} - μ_{1})^{T}} = §_{1}, \end{aligned}$ for each $i, j \in {1, \dots, n_{1}}$ with $i \neq j$ , where $E {U (X_{i} - X_{j}) U (X_{i} - X_{j})^{T}}$ is the so-called population multivariate Kendall's tau matrix of $X$ (Oja, Citation2010). Similarly, $\begin{aligned} E {U (Y_{i} - Y_{j}) U (Y_{i} - Y_{j})^{T}} \\ = E {U (Y_{i} - μ_{2}) U (Y_{i} - μ_{2})^{T}} = §_{2}, \end{aligned}$ for each $i, j \in {1, \dots, n_{2}}$ with $i \neq j$ , where $E {U (Y_{i} - Y_{j}) U (Y_{i} - Y_{j})^{T}}$ is the population multivariate Kendall's tau matrix of $Y$ . Lemma 2.3 suggests that for each of the two populations, the population multivariate Kendall's tau matrix is the same as the spatial sign covariance matrix. As a result, testing equality of the two spatial sign covariance matrices is identical to testing equality of the two population multivariate Kendall's tau matrices.

Moreover, it can be seen that the three components of the Frobenius norm of the difference between $§_{1}$ and $§_{2}$ , $t r {(§_{1} - §_{2})^{2}} = t r (§_{1}^{2}) + t r (§_{2}^{2}) - 2 t r (§_{1} §_{2})$ , have the following equivalent representations: $t r (§_{1}^{2}) = E [{U (X_{i} - X_{j})^{T} U (X_{k} - X_{l})}^{2}],$ for each $i, j, k, l \in {1, \dots, n_{1}}$ , where i, j, k, l are not equal to each other; $t r (§_{2}^{2}) = E [{U (Y_{i} - Y_{j})^{T} U (Y_{k} - Y_{l})}^{2}],$ for each $i, j, k, l \in {1, \dots, n_{2}}$ , where i, j, k, l are not equal to each other; $t r (§_{1} §_{2}) = E [{U (X_{i} - X_{j})^{T} U (Y_{k} - Y_{l})}^{2}],$ for each $i, j \in {1, \dots, n_{1}}$ with $i \neq j$ and each $k, l \in {1, \dots, n_{2}}$ with $k \neq l$ . These representations finally enlighten us to construct $T_{H T}$ as that in the above subsection, which is actually a consistent estimator of $p t r {(§_{1} - §_{2})^{2}}$ .

Unlike the spatial sign covariance matrix, to estimate the multivariate Kendall's tau matrix, it is not necessary to estimate the spatial medians, whose estimators may bring a bias hence strengthens the condition imposed on the dimension p. That is the reason why we propose to use a new test procedure based on the multivariate Kendall's tau matrix rather than the spatial sign covariance matrix. Therefore, the condition imposed on the dimension p can be released to some extent, which makes the proposed test procedure powerful in high-dimensional data, even with the dimension much larger than the sample sizes.

In fact, in the spatial sign test proposed by Cheng et al. (Citation2018), to test the equality of the two spatial sign covariance matrices $§_{1}$ and $§_{2}$ , the test statistic is $\begin{aligned} T_{S S} & = \frac{p}{n_{1} (n_{1} - 1)} \sum_{i \neq j}^{n_{1}} ({\hat{u}}_{i}^{T} {\hat{u}}_{j})^{2} + \frac{p}{n_{2} (n_{2} - 1)} \\ \times \sum_{i \neq j}^{n_{2}} ({\hat{v}}_{i}^{T} {\hat{v}}_{j})^{2} - \frac{2 p}{n_{1} n_{2}} \sum_{i = 1}^{n_{1}} \sum_{j = 1}^{n 2} ({\hat{u}}_{i}^{T} {\hat{v}}_{j})^{2}, \end{aligned}$ where ${\hat{u}}_{i} = U (X_{i} - {\hat{μ}}_{1})$ and ${\hat{v}}_{j} = U (Y_{j} - {\hat{μ}}_{2})$ for $i = 1, \dots, n_{1}$ , $j = 1, \dots, n_{2}$ . Here, ${\hat{μ}}_{1}$ and ${\hat{μ}}_{2}$ are the spatial median estimators of $X$ and $Y$ , respectively, obtained by using the estimation method proposed in Mottonen and Oja (Citation1995). $T_{S S}$ is an estimator of $p t r {(§_{1} - §_{2})^{2}}$ , but unfortunately $E (T_{S S} / p) - t r {(§_{1} - §_{2})^{2}} = δ_{n_{1}, n_{2}} \neq 0$ , due to the spatial median estimators ${\hat{μ}}_{1}$ and ${\hat{μ}}_{2}$ (see Lemma 2 in Cheng et al., Citation2018). To obtain a consistent estimator of the bias $δ_{n_{1}, n_{2}}$ , the condition $p = O {(n_{1} + n_{2})^{2}}$ was imposed in Cheng et al. (Citation2018), which limits the application of $T_{S S}$ for the high-dimensional data where the dimension is much larger than the squares of sample sizes.

3. Simulation study

In this section, we will present some numerical results to demonstrate the performance of the proposed test (abbreviated as HT) in high-dimensional cases, in comparison with two existing popular tests, the test proposed by Li and Chen (Citation2012) (abbreviated as LZ) and the spatial sign test proposed by Cheng et al. (Citation2018) (abbreviated as SS). The following three scenarios are considered.

Multivariate normal distribution: $X \sim N_{p} (0, Σ_{1})$ and $Y \sim N_{p} (0, Σ_{2}) .$
Multivariate t-distribution: $X \sim t_{p} (0, Σ_{1}, 3)$ and $Y \sim t_{p} (0, Σ_{2}, 3) .$
Multivariate mixture normal distribution: $X \sim {M N}_{p, γ, 9} (0, Σ_{1}) ≜ γ N_{p} (0, Σ_{1}) + (1 - γ) N_{p} (0, 9 Σ_{1})$ , $Y \sim {M N}_{q, γ, 9} (0, Σ_{2})$ , $γ = 0.8$ .

For all the above scenarios, let $Σ_{1} = ({0.3}^{| i - j |})$ and $Σ_{2} = (ρ^{| i - j |})$ with $ρ = 0.3$ , 0.6, 0.7. Then, $ρ = 0.3$ corresponds to the situation where the null hypothesis is true, while $ρ = 0.6$ or 0.7 corresponds to the situation where the alternative hypothesis is true. Note that all the following simulation results are obtained based on 1000 replications.

First, to observe the influence of the dimension p to the potential bias of the methods involved, we summarize the results of the mean-standard deviation-ratio $E (T) / \sqrt{v a r (T)}$ and the variance estimator ratio $\hat{v a r (T)} / v a r (T)$ under the null hypothesis in Table for each $T \in {T_{H T}, T_{S S}, T_{L Z}}$ with $n_{1} = n_{2} = 15$ and p = 100, 200, 400, 800, 1200, where $T_{L Z}$ is the test statistic proposed in Li and Chen (Citation2012). Since the exact value of $E (T)$ and $v a r (T)$ are difficult to calculate, we replace them with their Monte-Carlo estimators respectively, using 1000 repeated samplings.

Table 1. Comparison of the mean-standard deviation-ratio and the variance estimator ratio at the 5% level with $n_{1} = n_{2} = 15$ and p = 100, 200, 400, 800, 1200.

Display Table

Table indicates that SS has worse mean-standard deviation-ratio results than the other two methods in high-dimensional situations, particularly when $p > (n_{1} + n_{2})^{2}$ . This is most likely due to the fact that in $T_{S S}$ the bias correction process is limited by the condition that $p = O {(n_{1} + n_{2})^{2}}$ . On the other hand, suggested by the variance estimator ratio results of Table , the estimated variances of LZ are eventually larger than the real ones, particularly in non-normal situations. In contrast, HT has better performance in these two aspects.

Then, we will compare the performance of the three methods in empirical size and empirical power. Let $n_{1} = n_{2} = 15$ , 20, 30 and p = 100, 200, 400, 800, 1200. Tables summarize the empirical size and power results of the three methods. First, the empirical size results in Tables , corresponding to the setting of $ρ = 0.3$ , suggest that LZ fails to control the empirical size in the non-normal cases. Moreover, when comparing HT with SS, we find that their performance is very similar, except in the cases where the dimension is comparable to or larger than the squares of the sample sizes, i.e. $1200 > (15 + 15)^{2}$ . In such cases, SS may lose control of the empirical size, which is consistent with the conclusion made by analysing Table . In the above results about the empirical size, in a few cases, the empirical size is slightly larger than 5%, but still within a reasonable range. To comprehensively compare the empirical size and power of the three tests, in Figure , we present the receiver operating characteristic curves (ROCs) for the three tests with $(n_{1}, n_{2}, p) = (15, 15, 800)$ . Suggested by Figure , these tests have similar performance under the multivariate normal distributions, while under the remaining heavy-tailed distributions, the area under ROC (AUC) of the proposed HT test is larger than the AUCs of its competitors. This further demonstrates the advantages of the proposed test.

Figure 1. ROC curves of the involved tests under the three scenarios with $(n_{1}, n_{2}, p) = (15, 15, 800)$ .

Table 2. Empirical size and power comparison at the 5% level with $n_{1} = n_{2} = 15$ and p = 100, 200, 400, 800, 1200.

Display Table

Table 3. Empirical size and power comparison at the 5% level with $n_{1} = n_{2} = 20$ and p = 100, 200, 400, 800, 1200.

Display Table

Table 4. Empirical size and power comparison at the 5% level with $n_{1} = n_{2} = 30$ and p = 100, 200, 400, 800, 1200.

Display Table

Next, we consider an alternative structure of the covariance matrices, i.e. $Σ_{i} = (a_{i k l})$ for each $i \in {1, 2}$ , where $a_{i k k} = 1, a_{i k, k + 1} = \frac{ρ_{i} + ρ_{i}^{2}}{1 + 2 ρ_{i}^{2}}, a_{k, k + 2} = \frac{ρ_{i}}{1 + 2 ρ_{i}^{2}}$ for each $k \in {1, \dots, p}$ and the remaining entries of $Σ_{i}$ are all zeros. Note that $Σ_{i}$ is the corresponding covariance matrix of $x_{i}$ following the MA $(2)$ model: $x_{i t} = z_{i t} + ρ_{i} z_{i, t - 1} + ρ_{i} z_{i, t - 2},$ where $z_{i t}$ 's are i.i.d. random variables with mean zero and variance $\frac{1}{1 + 2 ρ_{i}^{2}}$ . Under the null hypothesis, we set $ρ_{1} = ρ_{2} = 0.7$ , while under the alternative hypothesis, we set $ρ_{1} = 0.7$ and $ρ_{2} = 0.1$ for instance. The other settings are all the same as the above. Tables and report the empirical sizes and power of these three methods, respectively. Although Table suggests that the performance of empirical power of the three methods is similar, Table suggests that the abilities of LZ and SS to control the empirical size are weakening much more quickly than HT with the increase of p for fixed $n_{1}$ and $n_{2}$ , especially when the dimension is comparable to or larger than the squares of the sample sizes.

Table 5. Empirical size comparison at the 5% level with the MA(2) covariance matrices with $n_{1} = n_{2} = 15$ , 20, 30 and p = 100, 200, 400, 800, 1200.

Display Table

Table 6. Empirical power comparison at the 5% level with the MA(2) covariance matrices with $n_{1} = n_{2} = 15$ , 20, 30 and p = 100, 200, 400, 800, 1200.

Display Table

Overall, the comprehensive numerical results suggest that the proposed HT test has obvious advantages in terms of controlling empirical size over the existing two methods. Such gain is especially clear when the original distribution deviates from normality, and when the dimension is larger than the squares of sample sizes.

4. Application

In this section, we apply the proposed testing method to a gene dataset, which contains the expression of the 2000 genes with the highest minimal intensity across the 62 tissues. Each entry in the dataset is a gene intensity derived using the filtering process proposed in Alon et al. (Citation1999). The dataset was previously studied by Alon et al. (Citation1999), and now can be freely downloaded at the following website: http://genomics-pubs.princeton.edu/oncology/affydata/index.html.

Among the 62 tissues, there are 22 normal tissues and 40 tumour colon tissues. We aim to test the hypothesis that the tissues in the tumour group and those in the normal group have the proportional covariance matrices in terms of the expression levels of the 2000 genes, where the dimension 2000 is larger than the squares of the sample sizes, 484 and 1600.

First, the normal distribution was tested for the expression data of each gene, using the Shapiro–Wilk test. The top two panels of Figure present the histograms of the p-values of the normality tests for the tumour group and the normal group, respectively, which indicate that for a large number of genes the expression data are non-normal. In fact, under the significance level of 0.05, the overall rejection rates of all the normality tests are $93.55 %$ and $37.75 %$ for the tumour group and the normal group, respectively. This motivates us to use a nonparametric approach for testing the above hypothesis, which can deal with the high-dimensional data from non-normal distributions.

The bottom two panels of Figure indicate that there exist some genes with very high values of sample mean in terms of expression. We see that the sample means vary largely for each of the two groups and recall that the dimension is larger than the squares of the sample sizes, which raises a concern that using a spatial sign-based approach may lead to an uncontrollable bias. Hence, in theory, a spatial rank-based approach is more appropriate for this dataset.

Figure 2. Histograms of the p-values of the normality tests and the gene expression means, for the tumour group and the normal group, respectively.

Based on the above reasons, we apply the proposed HT test to this dataset. The test statistic and p-value of the HT test are 4.823 and $0.000,$ respectively, hence the null hypothesis is rejected, which suggests that the covariance matrix of the gene expression levels of the tumour group is significantly not proportional to that of the normal group. This result can also be intuitively verified by comparing the sample correlation matrices of the two groups. As a convenience and for demonstration purposes, in , we only plot the heatmaps of the sample correlation matrices of the two groups as well as the difference of the two matrices using the first 100 genes in the original data. The heatmaps demonstrate that there are some intuitive differences between the two sample correlation matrices, which tends to support our result of rejecting the null hypothesis.

Figure 3. Heatmaps of the sample correlation matrices of the two groups as well as the difference of the two matrices, which are constructed via the first 100 genes in the original data. (a) Normal group, (b) tumour group and (c) difference of two groups.

5. Conclusion

We have proposed the HT test, a new high-dimensional spatial rank test, for the proportionality testing problem of two high-dimensional covariance matrices, which is a high-dimensional extension of Kendall's tau test. It inherits the robustness advantage of the traditional spatial rank-based methods, and also has strong potential in dealing with the high-dimensional data, where the dimension can be potentially much larger than the squares of the sample sizes. We establish the asymptotic distributions of the proposed method rigorously. In comparison with some existing test procedures, the gain in empirical power and empirical size of HT is especially clear in high-dimensional and heavy-tailed data, shown by many numerical evidence. The real data analysis shows the applicability and pertinence of the proposed method to high-dimensional gene expression data.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

This work was supported by the National Natural Science Foundation of China [Grant Numbers 11501092, 11571068] and the Special Fund for Key Laboratories of Jilin Province, China [Grant Number 20190201285JC].

References

Alon, U., Barkai, N., Notterman, D. A., Gish, K., Ybarra, S., & Levine, D. M. J. (1999). Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academy of Sciences of the United States of America, 96(12), 6745–6750. https://doi.org/https://doi.org/10.1073/pnas.96.12.6745
PubMed Web of Science ®Google Scholar
Bai, Z., & Saranadasa, H. (1996). Effect of high dimension: By an example of a two sample problem. Statistica Sinica, 6(2), 311–329.
Web of Science ®Google Scholar
Barber, R. F., & Kolar, M. (2018). Rocket: Robust confidence intervals via Kendall's tau for transelliptical graphical models. The Annals of Statistics, 46(6B), 3422–3450. https://doi.org/https://doi.org/10.1214/17-AOS1663
Web of Science ®Google Scholar
Bühlmann, P, & van de Geer, S. (2011). Statistics for High-dimensional Data: Methods, Theory and Applications (1st ed.). Springer Publishing Company, Incorporated.
Google Scholar
Cai, T. T., & Zhang, A. (2016). Inference for high-dimensional differential correlation matrices. Journal of Multivariate Analysis, 143(6009), 107–126. https://doi.org/https://doi.org/10.1016/j.jmva.2015.08.019
Google Scholar
Chen, S. X., & Qinm, Y. L. (2010). A two-sample test for high-dimensional data with applications to gene-set testing. Annals of Statistics, 38(2), 808–835.https://doi.org/https://doi.org/10.1214/09-AOS716
Web of Science ®Google Scholar
Cheng, G., Liu, B., Peng, L., Zhang, B., & Zheng, S. (2018). Testing the equality of two high-dimensional spatial sign covariance matrices. Scandinavian Journal of Statistics, 46(1), 257–271. https://doi.org/https://doi.org/10.1111/sjos.v46.1
Web of Science ®Google Scholar
Eriksen, P. S. (1987). Proportionality of covariance matrices. Annals of Statistics, 15(2), 732–748. https://doi.org/https://doi.org/10.1214/aos/1176350372
Web of Science ®Google Scholar
Fang, K. T., Kotz, S., & Ng, K. W. (1990). Symmetric Multivariate and Related Distributions. Chapman and Hall.
Google Scholar
Federer, W. T. (1951). Testing proportionality of covariance matrices. Annals of Mathematical Statistics, 22(1), 102–106. https://doi.org/https://doi.org/10.1214/aoms/1177729697
Google Scholar
Feng, L., & Liu, B. (2017). High-dimensional rank tests for sphericity. Journal of Multivariate Analysis, 155, 217–233. https://doi.org/https://doi.org/10.1016/j.jmva.2017.01.003
Web of Science ®Google Scholar
Feng, L., & Sun, F. (2016). Spatial-sign based high-dimensional location test. Electronic Journal of Statistics, 10(2), 2420–2434. https://doi.org/https://doi.org/10.1214/16-EJS1176
Web of Science ®Google Scholar
Feng, L., Zou, C., & Wang, Z. (2016). Multivariate-sign-based high-dimensional tests for the two-sample location problem. Journal of the American Statistical Association, 111(514), 721–735. https://doi.org/https://doi.org/10.1080/01621459.2015.1035380
Web of Science ®Google Scholar
Flury, B. K. (1986). Proportionality of k covariance matrices. Statistics and Probability Letters, 4(1), 29–33. https://doi.org/https://doi.org/10.1016/0167-7152(86)90035-0
Web of Science ®Google Scholar
Flury, B. K., & Riedwyl, H. (1988). Multivariate Statistics: A Practical Approach. Chapman and Hall.
Google Scholar
Hall, P. G., & Hyde, C. C. (1980). Martingale Central Limit Theory and Its Applications. Academic Press.
Google Scholar
Han, F., Chen, S., & Liu, H. (2017). Distribution-free tests of independence in high dimensions. Biometrika, 104(4), 813–828. https://doi.org/https://doi.org/10.1093/biomet/asx050
PubMed Web of Science ®Google Scholar
Han, F., & Liu, H. (2018). ECA: High-dimensional elliptical component analysis in non-Gaussian distributions. Journal of the American Statistical Association, 113(521), 252–268. https://doi.org/https://doi.org/10.1080/01621459.2016.1246366
Web of Science ®Google Scholar
Kim, D. Y. (1971). Statistical inference for constants of proportionality between covariance matrices. Technical Report 59, Stanford University.
Google Scholar
Leung, D., & Drton, M. (2018). Testing independence in high dimensions with sums of rank correlations. The Annals of Statistics, 46(1), 280–307. https://doi.org/https://doi.org/10.1214/17-AOS1550
Web of Science ®Google Scholar
Li, J., & Chen, S. X. (2012). Two sample tests for high dimensional covariance matrices. Annals of Statistics, 40(2), 908–940.https://doi.org/https://doi.org/10.1214/12-AOS993
Web of Science ®Google Scholar
Liu, B., Xu, L., Zheng, S., & Tian, G. (2014). A new test for the proportionality of two large-dimensional covariance matrices. Journal of Multivariate Analysis, 131(1), 293–308. https://doi.org/https://doi.org/10.1016/j.jmva.2014.06.008
Google Scholar
Magyar, A., & Tyler, D. (2014). The asymptotic inadmissibility of the spatial sign covariance matrix for elliptically symmetric distributions. Biometrika, 101(3), 673–688. https://doi.org/https://doi.org/10.1093/biomet/asu020
Web of Science ®Google Scholar
Mottonen, J., & Oja, H. (1995). Multivariate spatial sign and rank methods. Journal of Nonparametric Statistics, 5(2), 201–213. https://doi.org/https://doi.org/10.1080/10485259508832643
Google Scholar
Oja, H. (2010). Multivariate nonparametric methods with R. Springer.
Google Scholar
Rao, C. R. (1983). Likelihood ratio tests for relationships between two covariance matrices. In S. Karlin, T. Amemiya, & L. A. Goodman (Eds.), Studies in Econometrics, Time Series and Multivariate Statistics (pp. 529–543). Academic Press.
Google Scholar
Schott, J. R. (1991). Some tests for common principal component subspaces in several groups. Biometrika, 78(4), 771–777. https://doi.org/https://doi.org/10.1093/biomet/78.4.771
Web of Science ®Google Scholar
Schott, J. R. (1999). A test for proportional covariance matrices. Computational Statistics and Data Analysis, 32(2), 135–146. https://doi.org/https://doi.org/10.1016/S0167-9473(99)00032-8
Web of Science ®Google Scholar
Wang, L., Peng, B., & Li, R. (2015). A high-dimensional nonparametric multivariate test for mean vector. Journal of the American Statistical Association, 110(512), 1658–1669. https://doi.org/https://doi.org/10.1080/01621459.2014.988215
PubMed Web of Science ®Google Scholar
Xu, L., Liu, B., Zheng, S., & Bao, S. (2014). Testing proportionality of two large-dimensional covariance matrices. Computational Statistics and Data Analysis, 78, 43–55. https://doi.org/https://doi.org/10.1016/j.csda.2014.03.014
Web of Science ®Google Scholar
Zou, C. L., Peng, L. H., Feng, L., & Wang, Z. J. (2014). Multivariate sign-based high-dimensional tests for sphericity. Biometrika, 101(1), 229–236. https://doi.org/https://doi.org/10.1093/biomet/ast040
Web of Science ®Google Scholar

Appendix

Define

U_{X, i} = U (X_{i} - θ_{X}), U_{Y, i} = U (Y_{i} - θ_{Y}) .

Before proving the main theorem, below we recall some necessary lemmas.

Lemma A.1

Under Conditions (C1) and (C2), for any $p \times p$ symmetric matrix $W$ , $\begin{aligned} E {(U_{X, i}^{T} U_{X, j})^{4}} & = O (1) E^{2} {U_{X, i}^{T} U_{X, j})^{2}}, \\ E {(U_{X, i}^{T} W U_{X, i})^{2}} & = O (1) E^{2} (U_{X, i}^{T} W U_{X, i}), \\ E {(U_{X, i}^{T} W U_{X, j})^{2}} & = O (1) E^{2} (U_{X, i}^{T} W U_{X, j}) . \end{aligned}$

Note that Lemma A.1 is the same as Lemma 1 of Wang et al. (Citation2015).

Lemma A.2

Let $U^{*} = (U_{1}^{*}, \dots, U_{p}^{*})^{T}$ be a random vector uniformly distributed on the unit sphere of $R^{p}$ , then we have that

$E (U^{*}) = 0$ , $C o v (U^{*}) = p^{- 1} I_{p}$ , $E (U_{k}^{* 4}) = 3 p^{- 1} (p + 2)^{- 1}$ and $E (U_{k}^{* 2} U_{l}^{* 2}) = p^{- 1} (p + 2)^{- 1}$ for each $k, l \in {1, \dots, p}$ with $k \neq l;$
for any $p \times p$ symmetric matrix $W$ , $E {(U^{* T} W U^{*})^{2}} = p^{- 1} (p + 2)^{- 1} {{t r}^{2} (W) + 2 t r (W^{2})}$ and $E {(U^{* T} W U^{*})^{4}} = p^{- 2} (p + 2)^{- 2} {3 {t r}^{2} (W^{2}) + 6 t r (W^{2})}$ .

In Lemma A.2, the first statement has been proved in Section 3.1 of Fang et al. (Citation1990) and the second statement has been proved in Zou et al. (Citation2014).

Now, we are ready to present the proof of Theorem 2.2. Then, the proof of Theorem 2.1 can be directly obtained.

Proof of Theorem 2.2:

Define $\begin{aligned} V_{X, i} ≐ E {U (X_{i} - X_{j}) | X_{i}}, V_{Y, i} ≐ E {U (Y_{i} - Y_{j}) | Y_{i}}, \\ W_{Y, i j} ≐ U (Y_{i} - Y_{j}) - V_{Y, i} + V_{Y, j}, \\ W_{X, i j} ≐ U (X_{i} - X_{j}) - V_{X, i} + V_{X, j}, \\ B_{1} ≐ E (V_{X, i} V_{X, i}^{T}), B_{2} ≐ E (V_{Y, i} V_{Y, i}^{T}) . \end{aligned}$ Hence we have that $E (V_{X, i}^{T} V_{X, j}) = 0$ and $E (V_{X, i}^{T} W_{X, i j}) = 0$ . According to Lemma 1 in Feng and Liu (Citation2017), we have that $E (W_{X, i j}^{T} W_{X, i j}) \to 0$ as p goes to infinity and $B_{1} = 0.5 §_{1} {1 + o (1)}$ . The same goes for $W_{Y, i j}$ and $B_{2}$ . On this ground, by Lemma A.1, we have that $\begin{aligned} E {(V_{X, i}^{T} V_{X, j})^{4}} & = O (1) E^{2} {(V_{X, i}^{T} V_{X, j})^{2}}, \\ E {(V_{X, i}^{T} A V_{X, i})^{2}} & = O (1) E^{2} (V_{X, i}^{T} A V_{X, i}), \\ E {(V_{X, i}^{T} A V_{X, j})^{2}} & = O (1) E^{2} (V_{X, i}^{T} A V_{X, j}) . \end{aligned}$ As a result, the first part of $T_{H T}$ has the following decomposition: $\begin{aligned} \frac{p}{n_{1} (n_{1} - 1) (n_{1} - 2) (n_{1} - 3)} \sum^{*} {U (X_{i} - X_{j})^{T} U (X_{k} - X_{l})}^{2} \\ = \frac{4 p}{n_{1} (n_{1} - 1)} \sum^{*} (V_{X, i}^{T} V_{X, j})^{2} \\ + \frac{2 p}{n_{1} (n_{1} - 1) (n_{1} - 2)} \sum^{*} (V_{X, i}^{T} W_{X, k l})^{2} \\ + \frac{p}{n_{1} (n_{1} - 1) (n_{1} - 2) (n_{1} - 3)} \sum^{*} (W_{X, i j}^{T} W_{X, k l})^{2} \\ ≐ J_{1} + J_{2} + J_{3} . \end{aligned}$ According to Lemma A.2 and the fact that $E (W_{X, i j}^{T} W_{X, i j}) \to 0$ as p goes to infinity, we similarly have that $E (J_{2}^{2}) = o {p^{2} n^{- 3} t r (§_{1}^{2})} = o (σ_{n}^{2})$ and $E (J_{3}^{2}) = o (p^{2} n^{- 4}) = o (σ_{n}^{2})$ . Using the similar techniques, we can decompose the rest two parts of $T_{H T}$ , hence conclude that $\begin{aligned} T_{H T} & = \frac{4 p}{n_{1} (n_{1} - 1)} \sum^{*} (V_{X, i}^{T} V_{X, j})^{2} \\ + \frac{4 p}{n_{2} (n_{2} - 1)} \sum^{*} (V_{Y, i}^{T} V_{Y, j})^{2} \\ - \frac{8 p}{n_{1} n_{2}} \sum_{i = 1}^{n_{1}} \sum_{j = 1}^{n_{2}} (V_{X, i}^{T} V_{Y, j})^{2} + o_{p} (σ_{n}) \\ ≐ p A_{n_{1}} + p B_{n_{2}} - 2 p C_{n_{1}, n_{2}} + o_{p} (σ_{n}) . \end{aligned}$ Therefore, we have that $\begin{aligned} v a r (T_{H T}) / σ_{n}^{2} \\ = p^{2} σ_{n}^{- 2} {v a r (A_{n_{1}}) + v a r (B_{n_{2}}) + 4 v a r (C_{n_{1}, n_{2}}) \\ - 4 c o v (A_{n_{1}}, C_{n_{1}, n_{2}}) - 4 c o v (B_{n_{2}}, C_{n_{1}, n_{2}})} + o (1) . \end{aligned}$ Below we will consider each item in $v a r (T_{H T}) / σ_{n}^{2}$ one by one. Before we can get the further expression of $v a r (A_{n_{1}})$ , we need to study $E (A_{n_{1}}^{2})$ first. We have that $\begin{aligned} E (A_{n_{1}}^{2}) & = \frac{16}{n_{1}^{2} (n_{1} - 1)^{2}} E [{\sum^{*} (V_{X, i}^{T} V_{X, j})^{2}}^{2}] \\ = \frac{16}{n_{1}^{2} (n_{1} - 1)^{2}} [2 n_{1} (n_{1} - 1) E {(V_{X, i}^{T} V_{X, j})^{4}} \\ + 4 n_{1} (n_{1} - 1) (n_{1} - 2) E {(V_{X, i}^{T} V_{X, j})^{2} (V_{X, i}^{T} V_{X, k})^{2}} \\ + n_{1} (n_{1} - 1) (n_{1} - 2) (n_{1} - 3) \\ \times E {(V_{X, i}^{T} V_{X, j})^{2} (V_{X, k}^{T} V_{X, l})^{2}}] . \end{aligned}$ Using the same proof techniques as in Cheng et al. (Citation2018), we can get the following equations: $\begin{aligned} E {(V_{X, i}^{T} V_{X, j})^{4}} \\ = \frac{1}{4} p^{2} (p + 2)^{- 2} {3 {t r}^{2} (Λ_{1}^{2}) + 6 t r (Λ_{1}^{4})} {1 + o (1)}, \\ E {(V_{X, i}^{T} V_{X, j})^{2}} \\ = \frac{1}{2} t r (§_{1}^{2}) {1 + o (1)} = p^{- 2} t r (Λ_{1}^{2}) {1 + o (1)}, \\ E {(V_{X, i}^{T} V_{X, j})^{2} (V_{X, i}^{T} V_{X, k})^{2}} \\ = \frac{1}{4} p^{- 3} (p + 2)^{- 1} {{t r}^{2} (Λ_{1}^{2}) + 2 t r (Λ_{1}^{4})} {1 + o (1)} . \end{aligned}$ On this ground, we have that $\begin{aligned} v a r (A_{n_{1}}) \\ = {\frac{4}{n_{1} (n_{1} - 1)} \frac{{t r}^{2} (Λ_{1}^{2})}{p^{2} (p + 2)^{2}} + \frac{8}{n_{1}} \frac{p t r (Λ_{1}^{4}) - {t r}^{2} (Λ_{1}^{2})}{p^{4} (p + 2)}} \\ \times {1 + o (1)} . \end{aligned}$ Similarly, we have that $\begin{aligned} v a r (B_{n_{2}}) \\ = (\frac{4}{n_{2} (n_{2} - 1)} \frac{{t r}^{2} (Λ_{2}^{2})}{p^{2} (p + 2)^{2}} + \frac{8}{n_{2}} \frac{p t r (Λ_{2}^{4}) - {t r}^{2} (Λ_{2}^{2})}{p^{4} (p + 2)}) \\ \times {1 + o (1)}, \\ v a r (C_{n_{1}, n_{2}}) \\ = [\frac{2}{n_{1} n_{2}} \frac{{t r}^{2} (Λ_{1} Λ_{2})}{p^{2} (p + 2)^{2}} + (\frac{2}{n_{1}} + \frac{2}{n_{2}}) \\ \times \frac{p t r {(Λ_{1} Λ_{2})^{2}} - {t r}^{2} (Λ_{1} Λ_{2})}{p^{4} (p + 2)}] {1 + o (1)}, \\ c o v (A_{n_{1}}, C_{n_{1}, n_{2}}) \\ = {\frac{4}{n_{1}} \frac{p t r (Λ_{1}^{3} Λ_{2}) - t r (Λ_{1} Λ_{2}) t r (Λ_{1}^{2})}{p^{4} (p + 2)}} {1 + o (1)}, \\ c o v (B_{n_{2}}, C_{n_{1}, n_{2}}) \\ = {\frac{4}{n_{2}} \frac{p t r (Λ_{2}^{3} Λ_{1}) - t r (Λ_{1} Λ_{2}) t r (Λ_{2}^{2})}{p^{4} (p + 2)}} {1 + o (1)} . \end{aligned}$ To sum up, we conclude that $v a r (T_{H T}) = σ_{n}^{2} {1 + o (1)}$ .

Define a sequence of random variables ${z_{1}, \dots, z_{n_{1} + n_{2}}}$ as follows: $\begin{aligned} z_{i} & = V_{X, i} f o r e a c h i \in {1, \dots, n_{1}} a n d \\ z_{n_{1} + j} & = V_{Y, j} f o r e a c h j \in {1, \dots, n_{2}} . \end{aligned}$ Let $E_{k} (\cdot)$ denote the conditional expectation conditional on ${z_{1}, \dots, z_{k}}$ . Define $D_{n, k} = p^{- 1} {E_{k} (T_{H T}) - E_{k - 1} (T_{H T})}$ , then $p^{- 1} {T_{H T} - E (T_{H T})} = \sum_{k = 1}^{n_{1} + n_{2}} D_{n, k}$ . As a result, the sequence ${D_{n, 1}, \dots, D_{n, n_{1} + n_{2}}}$ constitutes a martingale difference with respect to the σ-fields $σ (z_{1}, z_{2}, \dots, z_{k})$ . To use the martingale central limit theorem, we need to get the following results first: (A1) $\begin{aligned} p^{2} \sum_{k = 1}^{n_{1} + n_{2}} σ_{n, k}^{2} / v a r (T_{H T}) ⟹ p 1 a n d \\ \sum_{k = 1}^{n_{1} + n_{2}} E (D_{n k}^{4}) = p^{- 4} o {{v a r}^{2} (T_{H T})}, \end{aligned}$ (A1) where $σ_{n, k}^{2} ≐ E_{k - 1} (D_{n, k}^{2})$ .

Proof

Proof of the first part of (EquationA1(A1) $\begin{aligned} p^{2} \sum_{k = 1}^{n_{1} + n_{2}} σ_{n, k}^{2} / v a r (T_{H T}) ⟹ p 1 a n d \\ \sum_{k = 1}^{n_{1} + n_{2}} E (D_{n k}^{4}) = p^{- 4} o {{v a r}^{2} (T_{H T})}, \end{aligned}$ (A1) )

As $E (\sum_{k = 1}^{n_{1} + n_{2}} σ_{n, k}^{2}) = p^{- 2} \times v a r (T_{H T})$ , we only need to show that as $min {n_{1}, n_{2}} \to \infty$ , $v a r (\sum_{k = 1}^{n_{1} + n_{2}} σ_{n, k}^{2}) = o {p^{- 4} {v a r}^{2} (T_{H T})}$ . Define $Γ_{1, k - 1} = \sum_{i = 1}^{k - 1} (z_{i} z_{i}^{'} - B_{1})$ for each $k \in {1, \dots, n_{1} - 1}$ , and define $Γ_{2, n_{1} + l - 1} = \sum_{i = 1}^{l - 1} (z_{n_{1} + i} z_{n_{1} + i}^{'} - B_{2})$ for each $l \in {1, \dots, n_{2} - 1}$ . For each $k \in {1, \dots, n_{1}}$ , we have that $\begin{aligned} (E_{k} - E_{k - 1}) (A_{n_{1}}) \\ = \frac{2}{n_{1} (n_{1} - 1)} {V_{X, k}^{T} Γ_{1, k - 1} V_{X, k} - t r (Γ_{1, k - 1} B_{1})} \\ + \frac{2}{n_{1}} {V_{X, k}^{T} B_{1} V_{X, k} - t r (B_{1}^{2})}, \end{aligned}$ $(E_{k} - E_{k - 1}) (B_{n_{2}}) = 0,$ and $(E_{k} - E_{k - 1}) (C_{n_{1}, n_{2}}) = \frac{1}{n_{1}} {V_{X, k}^{T} B_{2} V_{X, k} - t r (B_{1} B_{2})} .$ For each $k \in {n_{1} + 1, \dots, n_{1} + n_{2}}$ , we have that $(E_{k} - E_{k - 1}) (A_{n_{1}}) = 0,$ $\begin{aligned} (E_{k} - E_{k - 1}) (B_{n_{2}}) \\ = \frac{2}{n_{2} (n_{2} - 1)} {V_{Y, k - n_{1}}^{T} Γ_{2, k - 1} V_{Y, k - n_{1}} - t r (Γ_{2, k - 1} B_{2})} \\ + \frac{2}{n_{2}} {V_{Y, k - n_{1}}^{T} B_{2} V_{Y, k - n_{1}} - t r (B_{2}^{2})}, \end{aligned}$ and $\begin{aligned} (E_{k} - E_{k - 1}) (C_{n_{1}, n_{2}}) \\ = \frac{1}{n_{1} n_{2}} {V_{Y, k - n_{1}}^{T} (\sum_{i = 1}^{n_{1}} V_{X, i} V_{X, i}^{T}) V_{Y, k - n_{1}} \\ - t r (\sum_{i = 1}^{n_{1}} V_{X, i} V_{X, i}^{T} B_{2})} . \end{aligned}$ Thus, for each $k \in {1, \dots, n_{1}}$ , $\begin{aligned} σ_{n, k}^{2} & = E_{k - 1} ([\frac{2}{n_{1} (n_{1} - 1)} {V_{X, k}^{T} Γ_{1, k - 1} V_{X, k} - t r (Γ_{1, k - 1} B_{1})} \\ + \frac{2}{n_{1}} {V_{X, k}^{T} B_{1} V_{X, k} - t r (B_{1}^{2})} \\ {- \frac{2}{n_{1}} {V_{X, k}^{T} B_{2} V_{X, k} - t r (B_{1} B_{2})}]}^{2}) \\ = (\frac{8}{n_{1}^{2} (n_{1} - 1)^{2}} \frac{p t r (Γ_{1, k - 1} Λ_{1})^{2} - {t r}^{2} (Γ_{1, k - 1} Λ_{1})}{p^{2} (p + 2)} \\ + \frac{16}{n_{1}^{2} (n_{1} - 1)} \frac{p t r (Γ_{1, k - 1} Λ_{1}^{3}) - t r (Γ_{1, k - 1} Λ_{1}) t r (Λ_{1}^{2})}{p^{3} (p + 2)} \\ - \frac{16}{n_{1}^{2} (n_{1} - 1)} \\ \times \frac{p t r (Γ_{1, k - 1} Λ_{1} Λ_{2} Λ_{1}) - t r (Γ_{1, k - 1} Λ_{1}) t r (Λ_{1} Λ_{2})}{p^{3} (p + 2)} \\ + \frac{8}{n_{1}^{2}} \frac{p t r [{Λ_{1} (Λ_{1} - Λ_{2})}^{2}] - {t r}^{2} {Λ_{1} (Λ_{1} - Λ_{2})}}{p^{4} (p + 2)}) \\ \times {1 + o (1)}, \end{aligned}$ and for each $k \in {n_{1} + 1, \dots, n_{1} + n_{2}}$ , $\begin{aligned} σ_{n, k}^{2} & = E_{k - 1} ([\frac{2}{n_{2} (n_{2} - 1)} {V_{Y, k - n_{1}}^{T} Γ_{2, k - 1} V_{Y, k - n_{1}} \\ - t r (Γ_{2, k - 1} B_{2})} \\ + \frac{2}{n_{2}} {V_{Y, k - n_{1}}^{T} B_{2} V_{Y, k - n_{1}} - t r (B_{2}^{2})} \\ + \frac{2}{n_{1} n_{2}} {V_{Y, k - n_{1}}^{T} (\sum_{i = 1}^{n_{1}} V_{X, i} V_{X, i}^{T}) V_{Y, k - n_{1}} \\ {- t r (\sum_{i = 1}^{n_{1}} V_{X, i} V_{X, i}^{T} B_{2})}]}^{2}) \\ = [\frac{8}{n_{2}^{2} (n_{2} - 1)^{2}} \frac{p t r (Γ_{2, k - 1} Λ_{2})^{2} - {t r}^{2} (Γ_{2, k - 1} Λ_{2})}{p^{2} (p + 2)} \\ + \frac{16}{n_{2}^{2} (n_{2} - 1)} \frac{p t r (Γ_{2, k - 1} Λ_{2}^{3}) - t r (Γ_{2, k - 1} Λ_{2}) t r (Λ_{2}^{2})}{p^{3} (p + 2)} \end{aligned}$ $\begin{aligned} - \frac{16}{n_{1} n_{2}^{2} (n_{2} - 1)} \\ \times \frac{t r {Γ_{2, k - 1} Λ_{2} (\sum_{i = 1}^{n_{1}} V_{X, i} V_{X, i}^{T}) Λ_{2}}}{p^{2} (p + 2)} \\ - \frac{t r (Γ_{2, k - 1} Λ_{2}) t r {Λ_{2} (\sum_{i = 1}^{n_{1}} V_{X, i} V_{X, i}^{T})}}{p^{2} (p + 2)} \\ + \frac{8}{n_{2}^{2}} \frac{p t r (Λ_{2}^{4}) - {t r}^{2} (Λ_{2}^{2})}{p^{4} (p + 2)} \\ - \frac{16}{n_{1} n_{2}^{2}} \frac{p t r (\sum_{i = 1}^{n_{1}} V_{X, i} V_{X, i}^{T} Λ_{2}^{3})}{p^{3} (p + 2)} \\ - \frac{t r (\sum_{i = 1}^{n_{1}} V_{X, i} V_{X, i}^{T} Λ_{2}) t r (Λ_{2}^{2})}{p^{3} (p + 2)} \\ + \frac{8}{n_{1}^{2} n_{2}^{2}} \frac{p t r {(\sum_{i = 1}^{n_{1}} V_{X, i} V_{X, i}^{T} Λ_{2})^{2}}}{p^{2} (p + 2)} \\ - \frac{{t r}^{2} (Λ_{2} \sum_{i = 1}^{n_{1}} V_{X, i} V_{X, i}^{T})}{p^{2} (p + 2)}] {1 + o (1)} . \end{aligned}$ Hence $\begin{aligned} \sum_{k = 1}^{n_{1} + n_{2}} σ_{n, k}^{2} = (R_{1} + R_{2} + R_{3} + R_{4} + R_{5} \\ + R_{6} + R_{6} + R_{7} + C_{0}) {1 + o (1)}, \end{aligned}$ where $C_{0}$ is a constant, and $\begin{aligned} R_{1} & = \sum_{k = 1}^{n_{1}} \frac{8}{n_{1}^{2} (n_{1} - 1)^{2}} \\ \times \frac{p t r {(Γ_{1, k - 1} Λ_{1})^{2}} - {t r}^{2} (Γ_{1, k - 1} Λ_{1})}{p^{2} (p + 2)}, \\ R_{2} & = \sum_{l = 1}^{n_{2}} \frac{8}{n_{2}^{2} (n_{2} - 1)^{2}} \\ \times \frac{p t r {(Γ_{2, n_{1} + l - 1} Λ_{2})^{2}} - {t r}^{2} (Γ_{2, n_{1} + l - 1} Λ_{2})}{p^{2} (p + 2)}, \\ R_{3} & = \sum_{k = 1}^{n_{1}} \frac{16}{n_{1}^{2} (n_{1} - 1)} \\ \times [\frac{p t r {Γ_{1, k - 1} (Λ_{1}^{3} - Λ_{1} Λ_{2} Λ_{1})}}{p^{3} (p + 2)}, \\ - \frac{t r (Γ_{1, k - 1} Λ_{1}) t r {Λ_{1} (Λ_{1} - Λ_{2})}}{p^{3} (p + 2)}], \\ R_{4} & = \sum_{l = 1}^{n_{2}} \frac{16}{n_{2}^{2} (n_{2} - 1)} \\ \times \frac{p t r (Γ_{2, n_{1} + l - 1} Λ_{2}^{3}) - t r (Γ_{2, n_{1} + l - 1} Λ_{2}) t r (Λ_{2}^{2})}{p^{3} (p + 2)}, \\ R_{5} & = - \sum_{l = 1}^{n_{2}} \frac{16}{n_{1} n_{2}^{2} (n_{2} - 1)} \end{aligned}$ $\begin{aligned} \times [\frac{p t r {Γ_{2, n_{1} + l - 1} Λ_{2} (\sum_{i = 1}^{n_{1}} V_{X, i} V_{X, i}^{T}) Λ_{2}}}{p^{2} (p + 2)} \\ - \frac{t r (Γ_{2, n_{1} + l - 1} Λ_{2}) t r {Λ_{2} (\sum_{i = 1}^{n_{1}} V_{X, i} V_{X, i}^{T})}}{p^{2} (p + 2)}], \\ R_{6} & = \sum_{l = 1}^{n_{2}} \frac{8}{n_{1}^{2} n_{2}^{2}} \frac{p t r {(\sum_{i = 1}^{n_{1}} V_{X, i} V_{X, i}^{T} Λ_{2})^{2}}}{p^{2} (p + 2)} \\ - \frac{{t r}^{2} (Λ_{2} \sum_{i = 1}^{n_{1}} V_{X, i} V_{X, i}^{T})}{p^{2} (p + 2)}, \\ R_{7} & = \sum_{l = 1}^{n_{2}} - \frac{16}{n_{1} n_{2}^{2}} \frac{p t r (\sum_{i = 1}^{n_{1}} V_{X, i} V_{X, i}^{T} Λ_{2}^{3})}{p^{3} (p + 2)} \\ - \frac{t r (\sum_{i = 1}^{n_{1}} V_{X, i} V_{X, i}^{T} Λ_{2}) t r (Λ_{2}^{2})}{p^{3} (p + 2)} . \end{aligned}$ Moreover, to calculate the order of $v a r (R_{1})$ , we need to evaluate $v a r [\sum_{k = 1}^{n_{1}} p^{- 2} t r {(Γ_{1, k - 1} Λ_{1})^{2}}]$ . Since $\begin{aligned} E ({[p^{- 2} \sum_{k = 1}^{n_{1}} \sum_{i = 1}^{k - 1} \sum_{j = 1}^{k - 1} {(V_{X, i}^{T} Λ_{1} V_{X, j})^{2} - t r (Λ_{1} B_{1})^{2}}]}^{2}) \\ = p^{- 4} \sum_{k = 1}^{n_{1}} \sum_{m = 1}^{n_{1}} E [\sum_{i = 1}^{k - 1} \sum_{j = 1}^{k - 1} {(V_{X, i}^{T} Λ_{1} V_{X, j})^{2} - t r (Λ_{1} B_{1})^{2}} \\ \times \sum_{l = 1}^{m - 1} \sum_{h = 1}^{m - 1} {(V_{X, l}^{T} Λ_{1} V_{X, h})^{2} - t r (Λ_{1} B_{1})^{2}}] \\ \leq n_{1} \sum_{k = 1}^{n_{1}} (k - 1)^{2} p^{- 4} E {(V_{X, i}^{T} Λ_{1} V_{X, j})^{4} - {t r}^{2} (Λ_{1} B_{1})^{2}} \\ + n_{1}^{2} \sum_{k = 1}^{n_{1}} (k - 1)^{2} p^{- 4} . \\ E {(V_{X, i}^{T} Λ_{1} V_{X, j})^{2} (V_{X, i}^{T} Λ_{1} V_{X, l})^{2} - {t r}^{2} (Λ_{1} B_{1})^{2}} \\ = C_{1} n_{1}^{4} p^{- 4} p^{- 4} {{t r}^{2} (Λ_{1}^{4}) + t r (Λ_{1}^{8})} \\ + C_{2} n_{1}^{5} p^{- 4} p^{- 4} {t r (Λ_{1}^{8}) - p^{- 1} {t r}^{2} (Λ_{1}^{4})}, \end{aligned}$ where $C_{1}$ and $C_{2}$ are constants, we have that $\begin{aligned} v a r (R_{1}) & \leq C_{1} n_{1}^{- 4} p^{- 8} {t r}^{2} (Λ_{1}^{2}) t r (Λ_{1}^{4}) \\ + C_{2} n_{1}^{- 3} p^{- 8} t r (Λ_{1}^{4}) {t r (Λ_{1}^{4}) - p^{- 1} {t r}^{2} (Λ_{1}^{2})} . \end{aligned}$ Based on the fact that $t r (Λ_{1}^{4}) / {t r}^{2} (Λ_{1}^{2}) \to 0$ and the following inequality ${v a r}^{2} (T_{H T}) \geq K {\frac{{t r}^{4} (Λ_{1}^{2})}{p^{4} n_{1}^{4}} + \frac{{t r}^{4} (Λ_{2}^{2})}{p^{4} n_{2}^{4}}}$ for some constant K, we conclude that $p^{4} v a r (R_{1}) / {v a r}^{2} (T_{H T}) \to 0,$ which indicates that $v a r (R_{1}) = o {p^{- 4} {v a r}^{2} (T_{H T})}$ . By using similar techniques, we conclude that $v a r (R_{l}) = o {p^{- 4} \times {v a r}^{2} (T_{H T})}$ for each $l \in {1, \dots, 7}$ , based on which we finally conclude that $v a r (\sum_{k = 1}^{n_{1} + n_{2}} σ_{n, k}^{2}) = o {p^{- 4} {v a r}^{2} (T_{H T})}$ .

Proof

Proof of the second part of (EquationA1(A1) $\begin{aligned} p^{2} \sum_{k = 1}^{n_{1} + n_{2}} σ_{n, k}^{2} / v a r (T_{H T}) ⟹ p 1 a n d \\ \sum_{k = 1}^{n_{1} + n_{2}} E (D_{n k}^{4}) = p^{- 4} o {{v a r}^{2} (T_{H T})}, \end{aligned}$ (A1) )

For $1 \leq k \leq n_{1}$ , $\begin{aligned} \sum_{k = 1}^{n_{1}} E (D_{n k}^{4}) \\ = \sum_{k = 1}^{n_{1}} E ([\frac{2}{n_{1} (n_{1} - 1)} {V_{X, k}^{T} Γ_{1, k - 1} V_{X, k} - t r (Γ_{1, k - 1} B_{1})} \\ + \frac{2}{n_{1}} {V_{X, k}^{T} B_{1} V_{X, k} - t r (B_{1}^{2})} \\ {- \frac{2}{n_{1}} {V_{X, k}^{T} B_{2} V_{X, k} - t r (B_{1} B_{2})}]}^{4}) \\ \leq c_{1} {n_{1}^{- 3} p^{- 8} t r [{Λ_{1} (Λ_{1} - Λ_{2})}^{2}] (t r [{Λ_{1} (Λ_{1} - Λ_{2})}^{2}] \\ - p^{- 1} {t r}^{2} {Λ_{1} (Λ_{1} - Λ_{2})}) + n_{1}^{- 5} p^{- 8} {t r}^{4} (Λ_{1}^{2})}, \end{aligned}$ where $c_{1}$ is some constant. Then $p^{4} \sum_{k = 1}^{n_{1} + n_{2}} E (D_{n k}^{4}) / {v a r}^{2} (T_{H T}) \to 0.$ Similarly, for $n_{1} \leq k \leq n_{1} + n_{2}$ , $\begin{aligned} \sum_{k = n_{1}}^{n_{1} + n_{2}} E (D_{n k}^{4}) \\ \leq c_{2} (n_{1}^{- 1} n_{2}^{- 4} p^{- 8} {t r}^{2} (Λ_{1} Λ_{2}) t r {Λ_{2} (Λ_{1} - Λ_{2})}^{2} \\ \times [t r {Λ_{2} (Λ_{1} - Λ_{2})}^{2} \\ - p^{- 1} {t r}^{2} {Λ_{2} (Λ_{1} - Λ_{2})}] + n_{2}^{- 5} p^{- 8} {t r}^{4} (Λ_{2}^{2}) \\ + n_{1}^{- 2} n_{2}^{- 4} p^{- 8} {t r}^{4} (Λ_{1} Λ_{2})), \end{aligned}$

where $c_{2}$ is some constant. Then we have that as $n_{1}, n_{2} \to \infty$ , $\frac{p^{4} \sum_{k = 1}^{n_{1} + n_{2}} E (D_{n k}^{4})}{{v a r}^{2} (T_{H T})} \to 0.$ By using the martingale central limit theorem (Hall & Hyde, Citation1980), we finally conclude that $\frac{T_{H T} - E (T_{H T})}{v a r (T_{H T})} ⟹ d N (0, 1) .$

Proof of Proposition 2.1:

Using the same techniques as in the proof of Theorem 2.1, we have that $E (A_{1}) = {t r}^{2} (§_{1})$ and $\begin{aligned} V a r {\frac{p^{2} A_{1}}{t r (Λ_{1}^{2})}} & = O (\frac{p^{4}}{{t r}^{2} Λ_{1}^{2}} [\frac{2}{n_{1} (n_{1} - 1)} \\ \times {\frac{3 {t r}^{2} (Λ_{1}^{2}) + 6 t r (Λ_{1}^{4})}{p^{2} (p + 2)^{2}} - \frac{{t r}^{2} (Λ_{1}^{2})}{p^{4}}}]) . \end{aligned}$ Therefore, $\frac{p^{2} A_{1}}{t r (Λ_{1}^{2})} \to 1$ , and similarly, $\frac{p^{2} A_{2}}{t r (Λ_{2}^{2})} \to 1$ , hence $\frac{{\hat{σ}}_{0, n}^{2}}{σ_{0, n}^{2}} \to 1$ .

High-dimensional proportionality test of two covariance matrices and its application to gene expression data

Abstract

1. Introduction

2. Method

2.1. The proposed test

2.2. Relationship with the test proposed in Cheng et al. (Citation2018)

3. Simulation study

Table 1. Comparison of the mean-standard deviation-ratio and the variance estimator ratio at the 5% level with $n_{1} = n_{2} = 15$ and p = 100, 200, 400, 800, 1200.

Table 2. Empirical size and power comparison at the 5% level with $n_{1} = n_{2} = 15$ and p = 100, 200, 400, 800, 1200.

Table 3. Empirical size and power comparison at the 5% level with $n_{1} = n_{2} = 20$ and p = 100, 200, 400, 800, 1200.

Table 4. Empirical size and power comparison at the 5% level with $n_{1} = n_{2} = 30$ and p = 100, 200, 400, 800, 1200.

Table 5. Empirical size comparison at the 5% level with the MA(2) covariance matrices with $n_{1} = n_{2} = 15$ , 20, 30 and p = 100, 200, 400, 800, 1200.

Table 6. Empirical power comparison at the 5% level with the MA(2) covariance matrices with $n_{1} = n_{2} = 15$ , 20, 30 and p = 100, 200, 400, 800, 1200.

4. Application

5. Conclusion

Disclosure statement

References

Appendix

Proof of the first part of (EquationA1(A1) $\begin{aligned} p^{2} \sum_{k = 1}^{n_{1} + n_{2}} σ_{n, k}^{2} / v a r (T_{H T}) ⟹ p 1 a n d \\ \sum_{k = 1}^{n_{1} + n_{2}} E (D_{n k}^{4}) = p^{- 4} o {{v a r}^{2} (T_{H T})}, \end{aligned}$ (A1) )

Proof of the second part of (EquationA1(A1) $\begin{aligned} p^{2} \sum_{k = 1}^{n_{1} + n_{2}} σ_{n, k}^{2} / v a r (T_{H T}) ⟹ p 1 a n d \\ \sum_{k = 1}^{n_{1} + n_{2}} E (D_{n k}^{4}) = p^{- 4} o {{v a r}^{2} (T_{H T})}, \end{aligned}$ (A1) )

Information for

Open access

Opportunities

Help and information

High-dimensional proportionality test of two covariance matrices and its application to gene expression data

Abstract

1. Introduction

2. Method

2.1. The proposed test

2.2. Relationship with the test proposed in Cheng et al. (Citation2018)

3. Simulation study

Table 1. Comparison of the mean-standard deviation-ratio and the variance estimator ratio at the 5% level with n1=n2=15 and p = 100, 200, 400, 800, 1200.

Table 2. Empirical size and power comparison at the 5% level with n1=n2=15 and p = 100, 200, 400, 800, 1200.

Table 3. Empirical size and power comparison at the 5% level with n1=n2=20 and p = 100, 200, 400, 800, 1200.

Table 4. Empirical size and power comparison at the 5% level with n1=n2=30 and p = 100, 200, 400, 800, 1200.

Table 5. Empirical size comparison at the 5% level with the MA(2) covariance matrices with n1=n2=15, 20, 30 and p = 100, 200, 400, 800, 1200.

Table 6. Empirical power comparison at the 5% level with the MA(2) covariance matrices with n1=n2=15, 20, 30 and p = 100, 200, 400, 800, 1200.

4. Application

5. Conclusion

Disclosure statement

Additional information

Funding

References

Appendix

Proof of the first part of (EquationA1(A1) p2∑k=1n1+n2σn,k2/var(THT)⟹p1and∑k=1n1+n2E(Dnk4)=p−4o{var2(THT)},(A1) )

Proof of the second part of (EquationA1(A1) p2∑k=1n1+n2σn,k2/var(THT)⟹p1and∑k=1n1+n2E(Dnk4)=p−4o{var2(THT)},(A1) )

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date

Table 1. Comparison of the mean-standard deviation-ratio and the variance estimator ratio at the 5% level with $n_{1} = n_{2} = 15$ and p = 100, 200, 400, 800, 1200.

Table 2. Empirical size and power comparison at the 5% level with $n_{1} = n_{2} = 15$ and p = 100, 200, 400, 800, 1200.

Table 3. Empirical size and power comparison at the 5% level with $n_{1} = n_{2} = 20$ and p = 100, 200, 400, 800, 1200.

Table 4. Empirical size and power comparison at the 5% level with $n_{1} = n_{2} = 30$ and p = 100, 200, 400, 800, 1200.

Table 5. Empirical size comparison at the 5% level with the MA(2) covariance matrices with $n_{1} = n_{2} = 15$ , 20, 30 and p = 100, 200, 400, 800, 1200.

Table 6. Empirical power comparison at the 5% level with the MA(2) covariance matrices with $n_{1} = n_{2} = 15$ , 20, 30 and p = 100, 200, 400, 800, 1200.

Proof of the first part of (EquationA1(A1) $\begin{aligned} p^{2} \sum_{k = 1}^{n_{1} + n_{2}} σ_{n, k}^{2} / v a r (T_{H T}) ⟹ p 1 a n d \\ \sum_{k = 1}^{n_{1} + n_{2}} E (D_{n k}^{4}) = p^{- 4} o {{v a r}^{2} (T_{H T})}, \end{aligned}$ (A1) )

Proof of the second part of (EquationA1(A1) $\begin{aligned} p^{2} \sum_{k = 1}^{n_{1} + n_{2}} σ_{n, k}^{2} / v a r (T_{H T}) ⟹ p 1 a n d \\ \sum_{k = 1}^{n_{1} + n_{2}} E (D_{n k}^{4}) = p^{- 4} o {{v a r}^{2} (T_{H T})}, \end{aligned}$ (A1) )