820
Views
0
CrossRef citations to date
0
Altmetric
Articles

High-dimensional proportionality test of two covariance matrices and its application to gene expression data

, &
Pages 161-174 | Received 04 Nov 2020, Accepted 09 Sep 2021, Published online: 06 Oct 2021

Abstract

With the development of modern science and technology, more and more high-dimensional data appear in the application fields. Since the high dimension can potentially increase the complexity of the covariance structure, comparing the covariance matrices among populations is strongly motivated in high-dimensional data analysis. In this article, we consider the proportionality test of two high-dimensional covariance matrices, where the data dimension is potentially much larger than the sample sizes, or even larger than the squares of the sample sizes. We devise a novel high-dimensional spatial rank test that has much-improved power than many existing popular tests, especially for the data generated from some heavy-tailed distributions. The asymptotic normality of the proposed test statistics is established under the family of elliptically symmetric distributions, which is a more general distribution family than the normal distribution family, including numerous commonly used heavy-tailed distributions. Extensive numerical experiments demonstrate the superiority of the proposed test in terms of both empirical size and power. Then, a real data analysis demonstrates the practicability of the proposed test for high-dimensional gene expression data.

1. Introduction

High-dimensional data are nowadays more and more common in bioinformatics, material science, astronomy and other application fields, as data collection technology rapidly evolves (Bühlmann & van de Geer, Citation2011). However, due to limited resources available to replicate observations, the sample sizes are usually much smaller than the dimension, which makes most traditional statistical approaches no longer appropriate. Under such an embarrassing background, scientists in many application fields urgently need powerful approaches to gather the greatest scientific insight from data. Testing equality of the distributions of two populations is a crucial problem in high-dimensional statistics, which is extremely complex and far more challenging than that for fixed-dimensional data. Due to this extreme complexity, it is usually replaced by a simpler problem, i.e. testing equality of some numerical characteristics, such as means and covariances, of the two populations, which is very useful but much easier to implement.

There is already a large number of literature on detecting the difference between the means of two high-dimensional populations, such as Bai and Saranadasa (Citation1996), Chen and Qinm (Citation2010), and Feng et al. (Citation2016), to name just a few. In contrast, there are much fewer studies on high-dimensional covariance matrix test of two high-dimensional populations. Hence, in this article, we focus on comparing the covariance matrices among two populations, which is strongly motivated for high-dimensional data, as high data dimensions can potentially increase the complexity of the covariance structure (Li & Chen, Citation2012). In particular, we consider the testing problem of the proportionality of two high-dimensional covariance matrices, which investigates the simplest heteroscedasticity of the population covariance matrices (Xu et al., Citation2014). It is often a preparation procedure before the case–control analysis of genomic data. Let X and Y be two p-dimensional populations with the mean vectors μ1, μ2 and the covariance matrices Σ1, Σ2, respectively. The proportionality test of two population covariance matrices is formulated as follows: (1) H0:Σ1=cΣ2versusH1:Σ1cΣ2,(1) where c is an unknown scalar.

The proportionality testing problem in (Equation1) has been widely studied in various areas, such as in discriminant analysis and principal component analysis (Flury & Riedwyl, Citation1988; Schott, Citation1991), and there is a lot of early literature on its methodological researches, such as Eriksen (Citation1987), Federer (Citation1951), Flury (Citation1986), Kim (Citation1971), Rao (Citation1983), and Schott (Citation1999). For example, the most traditional test statistic is (n1+n2)j=1plog(λˆj)n1log(|Σˆ1|)+n2(plog(cˆlog(|Σˆ2|)))dχ(p2+p2)/22,where (λˆj,cˆ) are obtained by an iterative algorithm proposed in Flury (Citation1986) and Σˆ1,Σˆ2 are the corresponding sample covariance matrices, respectively. These researches are constructed based on the classical limit theorems, assuming that the sample sizes tend to infinity and the dimension is fixed, hence have difficulties to analyse the high-dimensional data, where the dimension is much larger than the sample sizes. To alleviate such difficulties, Xu et al. (Citation2014) proposed to use a pseudo-likelihood ratio test by extending the traditional likelihood ratio test with the statistic plog(p1tr(Σˆ1Σˆ21))log|Σˆ1Σˆ21|, which allows the dimension to increase proportionally with each sample size; furthermore, Liu et al. (Citation2014) proposed an improved method, which allows the dimension to be larger than one of the sample sizes. In addition, for the special case of c1 in (Equation1), Li and Chen (Citation2012) proposed a test statistic TLC=An1+An22Cn1,n2,where Ani=1ni(ni1)kl(XikXil)22ni(ni1)(ni2)j,k,l not equalXikXijXijXil+1ni(ni1)(ni2)(ni3)j,k,l,t not equalXikXijXitXil,Cn1,n2=1n1n2k=1n1l=1n2(X1kX2l)21n1n2(n11)kln1j=1n2X1kX2jX2jX1l1n1n2(n21)kln2j=1n1X2kX1jX1jX2l+1n1n2(n11)(n21)×ktn1jln2X1kX2jX1tX2l.As mentioned in Li and Chen (Citation2012), TLC is an unbiased estimation of tr{(Σ1Σ2)2}. Despite some progress, there are also drawbacks: first, these methods may have extremely poor performance for heavy-tailed distributions; second, the sample covariance matrices, which need to be inverted in the construction of the test statistic, are singular when the dimension is larger than both of the sample sizes.

To overcome these two drawbacks, more attention has been paid to nonparametric testing methods based on the multivariate sign or rank. Just recently, for testing the proportionality of two high-dimensional covariance matrices, Cheng et al. (Citation2018) proposed to use a test procedure based on the multivariate sign and demonstrated its good performance in high-dimensional data analysis, especially for the heavy-tailed distributions. Recall that for fixed-dimensional data, the multivariate sign and rank are widely used to construct robust tests (Oja, Citation2010). However, most of these tests cannot be effective for high-dimensional data. Therefore, many researches extend the traditional multivariate sign- or rank-based testing methods to the high-dimension data, such as Feng and Sun (Citation2016) and Wang et al. (Citation2015) for one-sample problems; Feng et al. (Citation2016) for two-sample problems; Feng and Liu (Citation2017) and Zou et al. (Citation2014) for sphericity testing problems. These researches clearly demonstrate the advantages of the high-dimensional multivariate sign- or rank-based methods in high-dimensional and heavy-tailed cases.

Unfortunately, due to the bias caused by estimating the location parameters, the test procedure based on the multivariate sign can only allow the dimension to be the squares of the sample sizes at most (Cheng et al., Citation2018), which makes the test procedure too restrictive for various practical applications, hence greatly affects the validity of the test procedure. For example, in genomic data analysis, genomic data typically carry thousands of dimensions for measurements on the genome, where the dimension can be much larger than the squares of the sample sizes. Therefore, it is very urgent to develop a new method to deal with the proportionality testing problem in (Equation1) for the high-dimensional data, where the dimension is much higher than the squares of the sample sizes. This is the motivation and intention of this article.

The rest of the article is organized as follows. In Section 2, we introduce the proposed high-dimensional spatial rank test and establish its asymptotic normality under the elliptically symmetric populations. Then, we demonstrate the numerical performance of the proposed test in Sections 3, followed by a real data analysis in Section 4. Finally, we conclude this article in Section 5 and relegate the technical proofs to Appendix.

2. Method

2.1. The proposed test

A p-dimensional random vector Z is said to follow an elliptically symmetric distribution, denoted by Ep(μ,Λ,Fξ), if it has the following stochastic representation: Z=μ+ξAU,where μ is the p-dimensional mean vector, ξ is a non-negative random variable, Fξ is the cumulative distribution function of ξ, U is independent of ξ and is uniformly distributed on the unit sphere Rp and A is a deterministic p×p-dimensional matrix satisfying AAT=Λ with tr(Λ)=1. It is known that the covariance matrix Σ and shape matrix Λ of the elliptical symmetric population Z will satisfy the equation Σ=p1E(ξ2)Λ.

Let X1,,Xn1 and Y1,,Yn2 denote the samples of two p-dimensional random vectors X and Y, which are generated from the two independent elliptically symmetric populations Ep(μ1,Λ1,Fξ1) and Ep(μ2,Λ2,Fξ2), respectively. From Section 3.1 in Magyar and Tyler (Citation2014), it is known that Σi and §i have the same eigenvectors for each i{1,2} under the assumption of elliptically symmetric distribution. Also, from Equation 3.9 in Magyar and Tyler (Citation2014), it is known that when the eigenvalues of the covariance matrices Σ1 and Σ2 are proportional, the spatial sign covariance matrices §1 and §2 have the same eigenvalues. Theorem 1 in Cheng et al. (Citation2018) showed that when §1 and §2 have the same eigenvalues, the eigenvalues of Σ1 and Σ2 are proportional. Hence, the hypotheses in (Equation1) are equivalent to the following hypotheses: (2) H0:§1=§2versusH1:§1§2,(2) where §1=E{U(Xμ1)U(Xμ1)T}, §2=E{U(Yμ2)U(Yμ2)T} are the spatial sign covariance matrices of X, Y, respectively, and U(z)=zzI(z0) for each zRp is the spatial sign function with denoting the L2-norm and I() denoting the indicator function. On this ground, Cheng et al. (Citation2018) suggested to use a test statistics based on the square Frobenius norm of §1§2, i.e. tr{(§1§2)2}.

The proposed spatial rank test in this article is also based on the square Frobenius norm of §1§2, which is a high-dimensional extension of Kendall's tau test for the hypotheses in (Equation2) (Oja, Citation2010). Specifically, the test statistic is (3) THT=pn1(n11)(n12)(n13)×{U(XiXj)TU(XkXl)}2+pn2(n21)(n22)(n23)×{U(YiYj)TU(YkYl)}22pn1(n11)n2(n21)×i=1n1jin1k=1n2lkn2{U(XiXj)TU(YkYl)}2,(3) where denotes summation over distinct indexes {i,j,k,l}{1,,n1} or {1,,n2}. Note that recently many developed versions of Kendall's tau test are frequently used on many related issues (Barber & Kolar, Citation2018; Cai & Zhang, Citation2016; Han et al., Citation2017; Leung & Drton, Citation2018).

In deriving the asymptotic properties of THT, we impose the following two conditions used in Cheng et al. (Citation2018):

  1. n1/(n1+n2)κ(0,1) as min{n1,n2};

  2. tr(ΛiΛjΛkΛl)=o{tr(ΛiΛj)tr(ΛkΛl)} for i,j,k,l{1,2}.

Note that: (1) Condition (C1) is a commonly used condition in high-dimensional two sample testing problems; (2) Condition (C2) is similar to Condition (A2) in Li and Chen (Citation2012); (3) If all the eigenvalues of Σ1 and Σ2 are bounded, Condition (C2) holds.

Remark 2.1

Note that the above Conditions (C1) and (C2) do not contain any restriction on p and n1, n2, since such restriction is not needed to control the following terms: i,j,k,l{U(XiXj)TU(XkXl)}2{U(XiXj)TU(XkXl)}2,i,j,k,l{U(YiYj)TU(YkYl)}2{U(YiYj)TU(YkYl)}2,i=1n1j=1n1k=1n2l=1n2{U(XiXj)TU(YkYl)}2i=1n1jin1k=1n2lkn2{U(XiXj)TU(YkYl)}2,which have been removed from THT. That is to say, we remove all the items that include at least one pair of identical vectors, such as {U(XiXj)TU(XiXj)}2, {U(XiXj)TU(XiXl)}2 and so on. Such type of strategy was previously used in Chen and Qinm (Citation2010). By removing the terms iXiTXi and kYkTYk from the test statistic proposed by Chen and Qinm (Citation2010), no restriction on p, n1 and n2 is needed.

Under the above two conditions, the limiting null distribution of THT is given in the following theorem.

Theorem 2.1

Under Conditions (C1), (C2) and H0, as n1, n2, p, σ0,n1THTdN(0,1),where σ0,n2=4(n11+n21)2(p+2)2tr2(Λ2) with Λ=Λ1=Λ2.

Moreover, we obtain the limiting distribution of THT under H1.

Theorem 2.2

Under Conditions (C1), (C2) and H1, as n1, n2, p, σn1[THTptr{(§1§2)2}]dN(0,1),where σn2=4n1(n11)tr2(Λ12)(p+2)2+8n1ptr(Λ14)tr2(Λ12)p2(p+2)×4n2(n21)tr2(Λ22)(p+2)2+8n2ptr(Λ24)tr2(Λ22)p2(p+2)+8n1n2tr2(Λ1Λ2)(p+2)2+(8n1+8n2)×ptr(Λ1Λ2)2tr2(Λ1Λ2)p2(p+2)16n1ptr(Λ13Λ2)tr(Λ1Λ2)tr(Λ12)p2(p+2)16n2ptr(Λ23Λ1)tr(Λ1Λ2)tr(Λ22)p2(p+2).

Due to the fact that ptr(§l2)=p1tr(Λl2){1+o(1)} for l = 1, 2 obtained by Cheng et al. (Citation2018), we propose to use the following estimator of σ0,n2: σˆ0,n2=4(n11+n21)2(n1+n2)1(p+2)2×p2(n1A1+n2A2),where A1=1n1(n11)(n12)(n13)×{U(XiXj)TU(XkXl)}2,A2=1n2(n21)(n22)(n23)×{U(YiYj)TU(YkYl)}2.As presented by the following proposition, σˆ0,n2 is a consistent estimator of σ0,n2 under H0.

Proposition 2.1

Under Conditions (C1), (C2) and H0, σˆ0,n2/σ0,n21.

Therefore, the proposed test with a nominal α level of significance rejects H0 if THTzασˆ0,n, where zα is the upper α-quantile of N(0,1). The asymptotic power function of THT is βn1,n2(§1,§2,α)=Φ{σn1σ0,nzα+pσn1tr(§1§2)2},where Φ() denotes the cumulative probability function of N(0,1).

2.2. Relationship with the test proposed in Cheng et al. (Citation2018)

The proposed spatial rank test seems to be more complex than the existing ones, such as the spatial sign test proposed by Cheng et al. (Citation2018). This is a price that we have to pay for making the proposed method powerful in testing the high-dimensional data, where the data dimension is potentially much larger than the squares of the sample sizes, especially for the data generated from heavy-tailed distributions. Below we will explain the motivation of the proposed method in detail.

First, we recall Lemma B.1 in Han and Liu (Citation2018).

Lemma 2.3

Let X, X~Ep(μ,Λ,Fξ), where X and X~ are independent, then E{U(XX~)U(XX~)T}=E{U(Xμ)U(Xμ)}.

By Lemma 2.3, we have that E{U(XiXj)U(XiXj)T}=E{U(Xiμ1)U(Xiμ1)T}=§1,for each i,j{1,,n1} with ij, where E{U(XiXj)U(XiXj)T} is the so-called population multivariate Kendall's tau matrix of X (Oja, Citation2010). Similarly, E{U(YiYj)U(YiYj)T}=E{U(Yiμ2)U(Yiμ2)T}=§2,for each i,j{1,,n2} with ij, where E{U(YiYj)U(YiYj)T} is the population multivariate Kendall's tau matrix of Y. Lemma 2.3 suggests that for each of the two populations, the population multivariate Kendall's tau matrix is the same as the spatial sign covariance matrix. As a result, testing equality of the two spatial sign covariance matrices is identical to testing equality of the two population multivariate Kendall's tau matrices.

Moreover, it can be seen that the three components of the Frobenius norm of the difference between §1 and §2, tr{(§1§2)2}=tr(§12)+tr(§22)2tr(§1§2), have the following equivalent representations: tr(§12)=E[{U(XiXj)TU(XkXl)}2],for each i,j,k,l{1,,n1}, where i, j, k, l are not equal to each other; tr(§22)=E[{U(YiYj)TU(YkYl)}2],for each i,j,k,l{1,,n2}, where i, j, k, l are not equal to each other; tr(§1§2)=E[{U(XiXj)TU(YkYl)}2],for each i,j{1,,n1} with ij and each k,l{1,,n2} with kl. These representations finally enlighten us to construct THT as that in the above subsection, which is actually a consistent estimator of ptr{(§1§2)2}.

Unlike the spatial sign covariance matrix, to estimate the multivariate Kendall's tau matrix, it is not necessary to estimate the spatial medians, whose estimators may bring a bias hence strengthens the condition imposed on the dimension p. That is the reason why we propose to use a new test procedure based on the multivariate Kendall's tau matrix rather than the spatial sign covariance matrix. Therefore, the condition imposed on the dimension p can be released to some extent, which makes the proposed test procedure powerful in high-dimensional data, even with the dimension much larger than the sample sizes.

In fact, in the spatial sign test proposed by Cheng et al. (Citation2018), to test the equality of the two spatial sign covariance matrices §1 and §2, the test statistic is TSS=pn1(n11)ijn1(uˆiTuˆj)2+pn2(n21)×ijn2(vˆiTvˆj)22pn1n2i=1n1j=1n2(uˆiTvˆj)2,where uˆi=U(Xiμˆ1) and vˆj=U(Yjμˆ2) for i=1,,n1, j=1,,n2. Here, μˆ1 and μˆ2 are the spatial median estimators of X and Y, respectively, obtained by using the estimation method proposed in Mottonen and Oja (Citation1995). TSS is an estimator of ptr{(§1§2)2}, but unfortunately E(TSS/p)tr{(§1§2)2}=δn1,n20, due to the spatial median estimators μˆ1 and μˆ2 (see Lemma 2 in Cheng et al., Citation2018). To obtain a consistent estimator of the bias δn1,n2, the condition p=O{(n1+n2)2} was imposed in Cheng et al. (Citation2018), which limits the application of TSS for the high-dimensional data where the dimension is much larger than the squares of sample sizes.

3. Simulation study

In this section, we will present some numerical results to demonstrate the performance of the proposed test (abbreviated as HT) in high-dimensional cases, in comparison with two existing popular tests, the test proposed by Li and Chen (Citation2012) (abbreviated as LZ) and the spatial sign test proposed by Cheng et al. (Citation2018) (abbreviated as SS). The following three scenarios are considered.

  1. Multivariate normal distribution: XNp(0,Σ1) and YNp(0,Σ2).

  2. Multivariate t-distribution: Xtp(0,Σ1,3) and Ytp(0,Σ2,3).

  3. Multivariate mixture normal distribution: XMNp,γ,9(0,Σ1)γNp(0,Σ1)+(1γ)Np(0,9Σ1), YMNq,γ,9(0,Σ2), γ=0.8.

For all the above scenarios, let Σ1=(0.3|ij|) and Σ2=(ρ|ij|) with ρ=0.3, 0.6, 0.7. Then, ρ=0.3 corresponds to the situation where the null hypothesis is true, while ρ=0.6 or 0.7 corresponds to the situation where the alternative hypothesis is true. Note that all the following simulation results are obtained based on 1000 replications.

First, to observe the influence of the dimension p to the potential bias of the methods involved, we summarize the results of the mean-standard deviation-ratio E(T)/var(T) and the variance estimator ratio var(T)ˆ/var(T) under the null hypothesis in Table  for each T{THT,TSS,TLZ} with n1=n2=15 and p = 100, 200, 400, 800, 1200, where TLZ is the test statistic proposed in Li and Chen (Citation2012). Since the exact value of E(T) and var(T) are difficult to calculate, we replace them with their Monte-Carlo estimators respectively, using 1000 repeated samplings.

Table 1. Comparison of the mean-standard deviation-ratio and the variance estimator ratio at the 5% level with n1=n2=15 and p = 100, 200, 400, 800, 1200.

Table  indicates that SS has worse mean-standard deviation-ratio results than the other two methods in high-dimensional situations, particularly when p>(n1+n2)2. This is most likely due to the fact that in TSS the bias correction process is limited by the condition that p=O{(n1+n2)2}. On the other hand, suggested by the variance estimator ratio results of Table , the estimated variances of LZ are eventually larger than the real ones, particularly in non-normal situations. In contrast, HT has better performance in these two aspects.

Then, we will compare the performance of the three methods in empirical size and empirical power. Let n1=n2=15, 20, 30 and p = 100, 200, 400, 800, 1200. Tables  summarize the empirical size and power results of the three methods. First, the empirical size results in Tables , corresponding to the setting of ρ=0.3, suggest that LZ fails to control the empirical size in the non-normal cases. Moreover, when comparing HT with SS, we find that their performance is very similar, except in the cases where the dimension is comparable to or larger than the squares of the sample sizes, i.e. 1200>(15+15)2. In such cases, SS may lose control of the empirical size, which is consistent with the conclusion made by analysing Table . In the above results about the empirical size, in a few cases, the empirical size is slightly larger than 5%, but still within a reasonable range. To comprehensively compare the empirical size and power of the three tests, in Figure , we present the receiver operating characteristic curves (ROCs) for the three tests with (n1,n2,p)=(15,15,800). Suggested by Figure , these tests have similar performance under the multivariate normal distributions, while under the remaining heavy-tailed distributions, the area under ROC (AUC) of the proposed HT test is larger than the AUCs of its competitors. This further demonstrates the advantages of the proposed test.

Figure 1. ROC curves of the involved tests under the three scenarios with (n1,n2,p)=(15,15,800).

Figure 1. ROC curves of the involved tests under the three scenarios with (n1,n2,p)=(15,15,800).

Table 2. Empirical size and power comparison at the 5% level with n1=n2=15 and p = 100, 200, 400, 800, 1200.

Table 3. Empirical size and power comparison at the 5% level with n1=n2=20 and p = 100, 200, 400, 800, 1200.

Table 4. Empirical size and power comparison at the 5% level with n1=n2=30 and p = 100, 200, 400, 800, 1200.

Next, we consider an alternative structure of the covariance matrices, i.e. Σi=(aikl) for each i{1,2}, where aikk=1,aik,k+1=ρi+ρi21+2ρi2,ak,k+2=ρi1+2ρi2for each k{1,,p} and the remaining entries of Σi are all zeros. Note that Σi is the corresponding covariance matrix of xi following the MA(2) model: xit=zit+ρizi,t1+ρizi,t2,where zit's are i.i.d. random variables with mean zero and variance 11+2ρi2. Under the null hypothesis, we set ρ1=ρ2=0.7, while under the alternative hypothesis, we set ρ1=0.7 and ρ2=0.1 for instance. The other settings are all the same as the above. Tables  and  report the empirical sizes and power of these three methods, respectively. Although Table  suggests that the performance of empirical power of the three methods is similar, Table  suggests that the abilities of LZ and SS to control the empirical size are weakening much more quickly than HT with the increase of p for fixed n1 and n2, especially when the dimension is comparable to or larger than the squares of the sample sizes.

Table 5. Empirical size comparison at the 5% level with the MA(2) covariance matrices with n1=n2=15, 20, 30 and p = 100, 200, 400, 800, 1200.

Table 6. Empirical power comparison at the 5% level with the MA(2) covariance matrices with n1=n2=15, 20, 30 and p = 100, 200, 400, 800, 1200.

Overall, the comprehensive numerical results suggest that the proposed HT test has obvious advantages in terms of controlling empirical size over the existing two methods. Such gain is especially clear when the original distribution deviates from normality, and when the dimension is larger than the squares of sample sizes.

4. Application

In this section, we apply the proposed testing method to a gene dataset, which contains the expression of the 2000 genes with the highest minimal intensity across the 62 tissues. Each entry in the dataset is a gene intensity derived using the filtering process proposed in Alon et al. (Citation1999). The dataset was previously studied by Alon et al. (Citation1999), and now can be freely downloaded at the following website: http://genomics-pubs.princeton.edu/oncology/affydata/index.html.

Among the 62 tissues, there are 22 normal tissues and 40 tumour colon tissues. We aim to test the hypothesis that the tissues in the tumour group and those in the normal group have the proportional covariance matrices in terms of the expression levels of the 2000 genes, where the dimension 2000 is larger than the squares of the sample sizes, 484 and 1600.

First, the normal distribution was tested for the expression data of each gene, using the Shapiro–Wilk test. The top two panels of Figure  present the histograms of the p-values of the normality tests for the tumour group and the normal group, respectively, which indicate that for a large number of genes the expression data are non-normal. In fact, under the significance level of 0.05, the overall rejection rates of all the normality tests are 93.55% and 37.75% for the tumour group and the normal group, respectively. This motivates us to use a nonparametric approach for testing the above hypothesis, which can deal with the high-dimensional data from non-normal distributions.

The bottom two panels of Figure  indicate that there exist some genes with very high values of sample mean in terms of expression. We see that the sample means vary largely for each of the two groups and recall that the dimension is larger than the squares of the sample sizes, which raises a concern that using a spatial sign-based approach may lead to an uncontrollable bias. Hence, in theory, a spatial rank-based approach is more appropriate for this dataset.

Figure 2. Histograms of the p-values of the normality tests and the gene expression means, for the tumour group and the normal group, respectively.

Figure 2. Histograms of the p-values of the normality tests and the gene expression means, for the tumour group and the normal group, respectively.

Based on the above reasons, we apply the proposed HT test to this dataset. The test statistic and p-value of the HT test are 4.823 and 0.000, respectively, hence the null hypothesis is rejected, which suggests that the covariance matrix of the gene expression levels of the tumour group is significantly not proportional to that of the normal group. This result can also be intuitively verified by comparing the sample correlation matrices of the two groups. As a convenience and for demonstration purposes, in , we only plot the heatmaps of the sample correlation matrices of the two groups as well as the difference of the two matrices using the first 100 genes in the original data. The heatmaps demonstrate that there are some intuitive differences between the two sample correlation matrices, which tends to support our result of rejecting the null hypothesis.

Figure 3. Heatmaps of the sample correlation matrices of the two groups as well as the difference of the two matrices, which are constructed via the first 100 genes in the original data. (a) Normal group, (b) tumour group and (c) difference of two groups.

Figure 3. Heatmaps of the sample correlation matrices of the two groups as well as the difference of the two matrices, which are constructed via the first 100 genes in the original data. (a) Normal group, (b) tumour group and (c) difference of two groups.

5. Conclusion

We have proposed the HT test, a new high-dimensional spatial rank test, for the proportionality testing problem of two high-dimensional covariance matrices, which is a high-dimensional extension of Kendall's tau test. It inherits the robustness advantage of the traditional spatial rank-based methods, and also has strong potential in dealing with the high-dimensional data, where the dimension can be potentially much larger than the squares of the sample sizes. We establish the asymptotic distributions of the proposed method rigorously. In comparison with some existing test procedures, the gain in empirical power and empirical size of HT is especially clear in high-dimensional and heavy-tailed data, shown by many numerical evidence. The real data analysis shows the applicability and pertinence of the proposed method to high-dimensional gene expression data.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

This work was supported by the National Natural Science Foundation of China [Grant Numbers 11501092, 11571068] and the Special Fund for Key Laboratories of Jilin Province, China [Grant Number 20190201285JC].

References

Appendix

Define UX,i=U(XiθX),UY,i=U(YiθY).Before proving the main theorem, below we recall some necessary lemmas.

Lemma A.1

Under Conditions (C1) and (C2), for any p×p symmetric matrix W, E{(UX,iTUX,j)4}=O(1)E2{UX,iTUX,j)2},E{(UX,iTWUX,i)2}=O(1)E2(UX,iTWUX,i),E{(UX,iTWUX,j)2}=O(1)E2(UX,iTWUX,j).

Note that Lemma A.1 is the same as Lemma 1 of Wang et al. (Citation2015).

Lemma A.2

Let U=(U1,,Up)T be a random vector uniformly distributed on the unit sphere of Rp, then we have that

  1. E(U)=0, Cov(U)=p1Ip, E(Uk4)=3p1(p+2)1 and E(Uk2Ul2)=p1(p+2)1 for each k,l{1,,p} with kl;

  2. for any p×p symmetric matrix W, E{(UTWU)2}=p1(p+2)1{tr2(W)+2tr(W2)} and E{(UTWU)4}=p2(p+2)2{3tr2(W2)+6tr(W2)}.

In Lemma A.2, the first statement has been proved in Section 3.1 of Fang et al. (Citation1990) and the second statement has been proved in Zou et al. (Citation2014).

Now, we are ready to present the proof of Theorem 2.2. Then, the proof of Theorem 2.1 can be directly obtained.

Proof of Theorem  2.2:

Define VX,iE{U(XiXj)|Xi},VY,iE{U(YiYj)|Yi},WY,ijU(YiYj)VY,i+VY,j,WX,ijU(XiXj)VX,i+VX,j,B1E(VX,iVX,iT),B2E(VY,iVY,iT).Hence we have that E(VX,iTVX,j)=0 and E(VX,iTWX,ij)=0. According to Lemma 1 in Feng and Liu (Citation2017), we have that E(WX,ijTWX,ij)0 as p goes to infinity and B1=0.5§1{1+o(1)}. The same goes for WY,ij and B2. On this ground, by Lemma A.1, we have that E{(VX,iTVX,j)4}=O(1)E2{(VX,iTVX,j)2},E{(VX,iTAVX,i)2}=O(1)E2(VX,iTAVX,i),E{(VX,iTAVX,j)2}=O(1)E2(VX,iTAVX,j).As a result, the first part of THT has the following decomposition: pn1(n11)(n12)(n13){U(XiXj)TU(XkXl)}2=4pn1(n11)(VX,iTVX,j)2+2pn1(n11)(n12)(VX,iTWX,kl)2+pn1(n11)(n12)(n13)(WX,ijTWX,kl)2J1+J2+J3.According to Lemma A.2 and the fact that E(WX,ijTWX,ij)0 as p goes to infinity, we similarly have that E(J22)=o{p2n3tr(§12)}=o(σn2) and E(J32)=o(p2n4)=o(σn2). Using the similar techniques, we can decompose the rest two parts of THT, hence conclude that THT=4pn1(n11)(VX,iTVX,j)2+4pn2(n21)(VY,iTVY,j)28pn1n2i=1n1j=1n2(VX,iTVY,j)2+op(σn)pAn1+pBn22pCn1,n2+op(σn).Therefore, we have that var(THT)/σn2=p2σn2{var(An1)+var(Bn2)+4var(Cn1,n2)4cov(An1,Cn1,n2)4cov(Bn2,Cn1,n2)}+o(1).Below we will consider each item in var(THT)/σn2 one by one. Before we can get the further expression of var(An1), we need to study E(An12) first. We have that E(An12)=16n12(n11)2E[{(VX,iTVX,j)2}2]=16n12(n11)2[2n1(n11)E{(VX,iTVX,j)4}+4n1(n11)(n12)E{(VX,iTVX,j)2(VX,iTVX,k)2}+n1(n11)(n12)(n13)×E{(VX,iTVX,j)2(VX,kTVX,l)2}].Using the same proof techniques as in Cheng et al. (Citation2018), we can get the following equations: E{(VX,iTVX,j)4}=14p2(p+2)2{3tr2(Λ12)+6tr(Λ14)}{1+o(1)},E{(VX,iTVX,j)2}=12tr(§12){1+o(1)}=p2tr(Λ12){1+o(1)},E{(VX,iTVX,j)2(VX,iTVX,k)2}=14p3(p+2)1{tr2(Λ12)+2tr(Λ14)}{1+o(1)}.On this ground, we have that var(An1)={4n1(n11)tr2(Λ12)p2(p+2)2+8n1ptr(Λ14)tr2(Λ12)p4(p+2)}×{1+o(1)}.Similarly, we have that var(Bn2)=(4n2(n21)tr2(Λ22)p2(p+2)2+8n2ptr(Λ24)tr2(Λ22)p4(p+2))×{1+o(1)},var(Cn1,n2)=[2n1n2tr2(Λ1Λ2)p2(p+2)2+(2n1+2n2)×ptr{(Λ1Λ2)2}tr2(Λ1Λ2)p4(p+2)]{1+o(1)},cov(An1,Cn1,n2)={4n1ptr(Λ13Λ2)tr(Λ1Λ2)tr(Λ12)p4(p+2)}{1+o(1)},cov(Bn2,Cn1,n2)={4n2ptr(Λ23Λ1)tr(Λ1Λ2)tr(Λ22)p4(p+2)}{1+o(1)}.To sum up, we conclude that var(THT)=σn2{1+o(1)}.

Define a sequence of random variables {z1,,zn1+n2} as follows: zi=VX,i for each i{1,,n1}andzn1+j=VY,j for each j{1,,n2}.Let Ek() denote the conditional expectation conditional on {z1,,zk}. Define Dn,k=p1{Ek(THT)Ek1(THT)}, then p1{THTE(THT)}=k=1n1+n2Dn,k. As a result, the sequence {Dn,1,,Dn,n1+n2} constitutes a martingale difference with respect to the σ-fields σ(z1,z2,,zk). To use the martingale central limit theorem, we need to get the following results first: (A1) p2k=1n1+n2σn,k2/var(THT)p1andk=1n1+n2E(Dnk4)=p4o{var2(THT)},(A1) where σn,k2Ek1(Dn,k2).

Proof

Proof of the first part of (EquationA1)

As E(k=1n1+n2σn,k2)=p2×var(THT), we only need to show that as min{n1,n2}, var(k=1n1+n2σn,k2)=o{p4var2(THT)}. Define Γ1,k1=i=1k1(ziziB1) for each k{1,,n11}, and define Γ2,n1+l1=i=1l1(zn1+izn1+iB2) for each l{1,,n21}. For each k{1,,n1}, we have that (EkEk1)(An1)=2n1(n11){VX,kTΓ1,k1VX,ktr(Γ1,k1B1)}+2n1{VX,kTB1VX,ktr(B12)}, (EkEk1)(Bn2)=0,and (EkEk1)(Cn1,n2)=1n1{VX,kTB2VX,ktr(B1B2)}.For each k{n1+1,,n1+n2}, we have that (EkEk1)(An1)=0, (EkEk1)(Bn2)=2n2(n21){VY,kn1TΓ2,k1VY,kn1tr(Γ2,k1B2)}+2n2{VY,kn1TB2VY,kn1tr(B22)},and (EkEk1)(Cn1,n2)=1n1n2{VY,kn1T(i=1n1VX,iVX,iT)VY,kn1tr(i=1n1VX,iVX,iTB2)}.Thus, for each k{1,,n1}, σn,k2=Ek1([2n1(n11){VX,kTΓ1,k1VX,ktr(Γ1,k1B1)}+2n1{VX,kTB1VX,ktr(B12)}2n1{VX,kTB2VX,ktr(B1B2)}]2)=(8n12(n11)2ptr(Γ1,k1Λ1)2tr2(Γ1,k1Λ1)p2(p+2)+16n12(n11)ptr(Γ1,k1Λ13)tr(Γ1,k1Λ1)tr(Λ12)p3(p+2)16n12(n11)×ptr(Γ1,k1Λ1Λ2Λ1)tr(Γ1,k1Λ1)tr(Λ1Λ2)p3(p+2)+8n12ptr[{Λ1(Λ1Λ2)}2]tr2{Λ1(Λ1Λ2)}p4(p+2))×{1+o(1)},and for each k{n1+1,,n1+n2}, σn,k2=Ek1([2n2(n21){VY,kn1TΓ2,k1VY,kn1tr(Γ2,k1B2)}+2n2{VY,kn1TB2VY,kn1tr(B22)}+2n1n2{VY,kn1T(i=1n1VX,iVX,iT)VY,kn1tr(i=1n1VX,iVX,iTB2)}]2)=[ptr{(i=1n1VX,iVX,iTΛ2)2}tr2(Λ2i=1n1VX,iVX,iT)p2(p+2)8n22(n21)2ptr(Γ2,k1Λ2)2tr2(Γ2,k1Λ2)p2(p+2)+16n22(n21)ptr(Γ2,k1Λ23)tr(Γ2,k1Λ2)tr(Λ22)p3(p+2) 16n1n22(n21)×tr{Γ2,k1Λ2(i=1n1VX,iVX,iT)Λ2}p2(p+2)tr(Γ2,k1Λ2)tr{Λ2(i=1n1VX,iVX,iT)}p2(p+2)+8n22ptr(Λ24)tr2(Λ22)p4(p+2)16n1n22ptr(i=1n1VX,iVX,iTΛ23)p3(p+2)tr(i=1n1VX,iVX,iTΛ2)tr(Λ22)p3(p+2)+8n12n22ptr{(i=1n1VX,iVX,iTΛ2)2}p2(p+2)tr2(Λ2i=1n1VX,iVX,iT)p2(p+2)]{1+o(1)}.Hence k=1n1+n2σn,k2=(R1+R2+R3+R4+R5+R6+R6+R7+C0){1+o(1)},where C0 is a constant, and R1=k=1n18n12(n11)2×ptr{(Γ1,k1Λ1)2}tr2(Γ1,k1Λ1)p2(p+2),R2=l=1n28n22(n21)2×ptr{(Γ2,n1+l1Λ2)2}tr2(Γ2,n1+l1Λ2)p2(p+2),R3=k=1n116n12(n11)×[ptr{Γ1,k1(Λ13Λ1Λ2Λ1)}p3(p+2),tr(Γ1,k1Λ1)tr{Λ1(Λ1Λ2)}p3(p+2)],R4=l=1n216n22(n21)×ptr(Γ2,n1+l1Λ23)tr(Γ2,n1+l1Λ2)tr(Λ22)p3(p+2),R5=l=1n216n1n22(n21) ×[ptr{Γ2,n1+l1Λ2(i=1n1VX,iVX,iT)Λ2}p2(p+2)tr(Γ2,n1+l1Λ2)tr{Λ2(i=1n1VX,iVX,iT)}p2(p+2)],R6=l=1n28n12n22ptr{(i=1n1VX,iVX,iTΛ2)2}p2(p+2)tr2(Λ2i=1n1VX,iVX,iT)p2(p+2),R7=l=1n216n1n22ptr(i=1n1VX,iVX,iTΛ23)p3(p+2)tr(i=1n1VX,iVX,iTΛ2)tr(Λ22)p3(p+2).Moreover, to calculate the order of var(R1), we need to evaluate var[k=1n1p2tr{(Γ1,k1Λ1)2}] . Since E([p2k=1n1i=1k1j=1k1{(VX,iTΛ1VX,j)2tr(Λ1B1)2}]2)=p4k=1n1m=1n1E[i=1k1j=1k1{(VX,iTΛ1VX,j)2tr(Λ1B1)2}×l=1m1h=1m1{(VX,lTΛ1VX,h)2tr(Λ1B1)2}]n1k=1n1(k1)2p4E{(VX,iTΛ1VX,j)4tr2(Λ1B1)2}+n12k=1n1(k1)2p4.E{(VX,iTΛ1VX,j)2(VX,iTΛ1VX,l)2tr2(Λ1B1)2}=C1n14p4p4{tr2(Λ14)+tr(Λ18)}+C2n15p4p4{tr(Λ18)p1tr2(Λ14)},where C1 and C2 are constants, we have that var(R1)C1n14p8tr2(Λ12)tr(Λ14)+C2n13p8tr(Λ14){tr(Λ14)p1tr2(Λ12)}.Based on the fact that tr(Λ14)/tr2(Λ12)0 and the following inequality var2(THT)K{tr4(Λ12)p4n14+tr4(Λ22)p4n24}for some constant K, we conclude that p4var(R1)/var2(THT)0,which indicates that var(R1)=o{p4var2(THT)}. By using similar techniques, we conclude that var(Rl)=o{p4×var2(THT)} for each l{1,,7}, based on which we finally conclude that var(k=1n1+n2σn,k2)=o{p4var2(THT)}.

Proof

Proof of the second part of (EquationA1)

For 1kn1, k=1n1E(Dnk4)=k=1n1E([2n1(n11){VX,kTΓ1,k1VX,ktr(Γ1,k1B1)}+2n1{VX,kTB1VX,ktr(B12)}2n1{VX,kTB2VX,ktr(B1B2)}]4)c1{n13p8tr[{Λ1(Λ1Λ2)}2](tr[{Λ1(Λ1Λ2)}2]p1tr2{Λ1(Λ1Λ2)})+n15p8tr4(Λ12)},where c1 is some constant. Then p4k=1n1+n2E(Dnk4)/var2(THT)0. Similarly, for n1kn1+n2, k=n1n1+n2E(Dnk4)c2(n11n24p8tr2(Λ1Λ2)tr{Λ2(Λ1Λ2)}2×[tr{Λ2(Λ1Λ2)}2p1tr2{Λ2(Λ1Λ2)}]+n25p8tr4(Λ22)+n12n24p8tr4(Λ1Λ2)),

where c2 is some constant. Then we have that as n1,n2, p4k=1n1+n2E(Dnk4)var2(THT)0.By using the martingale central limit theorem (Hall & Hyde, Citation1980), we finally conclude that THTE(THT)var(THT)dN(0,1).

Proof of Proposition  2.1:

Using the same techniques as in the proof of Theorem 2.1, we have that E(A1)=tr2(§1) and Var{p2A1tr(Λ12)}=O(p4tr2Λ12[2n1(n11)×{3tr2(Λ12)+6tr(Λ14)p2(p+2)2tr2(Λ12)p4}]).Therefore, p2A1tr(Λ12)1, and similarly, p2A2tr(Λ22)1, hence σˆ0,n2σ0,n21.