13,167
Views
84
CrossRef citations to date
0
Altmetric
Theory and Methods

A New Coefficient of Correlation

Pages 2009-2022 | Received 15 Oct 2019, Accepted 15 Mar 2020, Published online: 28 May 2020

Abstract

Abstract–Is it possible to define a coefficient of correlation which is (a) as simple as the classical coefficients like Pearson’s correlation or Spearman’s correlation, and yet (b) consistently estimates some simple and interpretable measure of the degree of dependence between the variables, which is 0 if and only if the variables are independent and 1 if and only if one is a measurable function of the other, and (c) has a simple asymptotic theory under the hypothesis of independence, like the classical coefficients? This article answers this question in the affirmative, by producing such a coefficient. No assumptions are needed on the distributions of the variables. There are several coefficients in the literature that converge to 0 if and only if the variables are independent, but none that satisfy any of the other properties mentioned above. Supplementary materials for this article are available online.

1 Introduction

The three most popular classical measures of statistical association are Pearson’s correlation coefficient, Spearman’s ρ, and Kendall’s τ. These coefficients are very powerful for detecting linear or monotone associations, and they have well-developed asymptotic theories for calculating p-values. However, the big problem is that they are not effective for detecting associations that are not monotonic, even in the complete absence of noise.

There have been many proposals to address this deficiency of the classical coefficients (Josse and Holmes Citation2016), such as the maximal correlation coefficient (Hirschfeld Citation1935; Gebelein Citation1941; Rényi Citation1959; Breiman and Friedman Citation1985), various coefficients based on joint cumulative distribution functions and ranks (Hoeffding Citation1948; Blum, Kiefer, and Rosenblatt Citation1961; Yanagimoto Citation1970; Puri and Sen Citation1971; Rosenblatt Citation1975; Csörgő Citation1985; Romano Citation1988; Bergsma and Dassios Citation2014; Nandy, Weihs, and Drton Citation2016; Weihs, Drton, and Leung Citation2016; Han, Chen, and Liu Citation2017; Wang, Jiang, and Liu Citation2017; Drton, Han, and Shi Citation2018; Gamboa, Klein, and Lagnoux Citation2018; Weihs, Drton, and Meinshausen Citation2018; Deb and Sen Citation2019), kernel-based methods (Gretton et al. Citation2005, Citation2008; Sen and Sen Citation2014; Pfister et al. Citation2018; Zhang et al. Citation2018), information theoretic coefficients (Linfoot Citation1957; Kraskov, Stogbauer, and Grassberger Citation2004; Reshef et al. Citation2011), coefficients based on copulas (Sklar Citation1959; Schweizer and Wolff Citation1981; Dette, Siburg, and Stoimenov Citation2013; Lopez-Paz, Hennig, and Schölkopf Citation2013; Zhang Citation2019), and coefficients based on pairwise distances (Friedman and Rafsky Citation1983; Székely, Rizzo, and Bakirov Citation2007; Székely and Rizzo Citation2009; Heller, Heller, and Gorfine Citation2013; Lyons Citation2013).

Some of these coefficients are popular among practitioners. But there are two common problems. First, most of these coefficients are designed for testing independence, and not for measuring the strength of the relationship between the variables. Ideally, one would like a coefficient that approaches its maximum value if and only if one variable looks more and more like a noiseless function of the other, just as Pearson correlation is close to its maximum value if and only if one variable is close to being a noiseless linear function of the other. It is sometimes believed that the maximal information coefficient (Reshef et al. Citation2011) and the maximal correlation coefficient (Rényi Citation1959) measure the strength of the relationship in the above sense, but we will see later in Section 6 that that’s not necessarily correct. Although they are maximized when one variable is a function of the other, the converse is not true. They may be equal to 1 even if the relationship is very noisy.

Second, most of these coefficients do not have simple asymptotic theories under the hypothesis of independence that facilitate the quick computation of p-values for testing independence. In the absence of such theories, the only recourse is to use computationally expensive permutation tests or other kinds of bootstrap.

In this situation, one may wonder if it is at all possible to define a coefficient that is (a) as simple as the classical coefficients, and yet (b) is a consistent estimator of some measure of dependence which is 0 if and only if the variables are independent and 1 if and only if one is a measurable function of the other, and (c) has a simple asymptotic theory under the hypothesis of independence, like the classical coefficients.

Such a coefficient is presented below. The formula is so simple that it is likely that there are many such coefficients, some of them possibly having better properties than the one presented below.

Let (X, Y) be a pair of random variables, where Y is not a constant. Let (X1,Y1),,(Xn,Yn) be iid pairs with the same law as (X, Y), where n2. The new coefficient has a simpler formula if the Xi’s and the Yi’s have no ties. This simpler formula is presented first, and then the general case is given. Suppose that the Xi’s and the Yi’s have no ties. Rearrange the data as (X(1),Y(1)),,(X(n),Y(n)) such that X(1)X(n). Since the Xi’s have no ties, there is a unique way of doing this. Let ri be the rank of Y(i), that is, the number of j such that Y(j)Y(i). The new correlation coefficient is defined as(1) ξn(X,Y):=13i=1n1|ri+1ri|n21.(1)

In the presence of ties, ξn is defined as follows. If there are ties among the Xi’s, then choose an increasing rearrangement as above by breaking ties uniformly at random. Let ri be as before, and additionally define li to be the number of j such that Y(j)Y(i). Then defineξn(X,Y):=1ni=1n1|ri+1ri|2i=1nli(nli).

When there are no ties among the Yi’s, l1,,ln is just a permutation of 1,,n, and so the denominator in the above expression is just n(n21)/3, which reduces this definition to the earlier expression (1).

The following theorem shows that ξn is a consistent estimator of a certain measure of dependence between the random variables X and Y.

Theorem 1.1.

If Y is not almost surely a constant, then as n,ξn(X,Y) converges almost surely to the deterministic limit(2) ξ(X,Y):=var(E(1{Yt}|X))dμ(t)var(1{Yt})dμ(t),(2) where μ is the law of Y. This limit belongs to the interval [0,1]. It is 0 if and only if X and Y are independent, and it is 1 if and only if there is a measurable function f:RR such that Y=f(X) almost surely.

Remark

s.

  1. Unlike most coefficients, ξn is not symmetric in X and Y. But that is intentional. We would like to keep it that way because we may want to understand if Y is a function X, and not just if one of the variables is a function of the other. If we want to understand whether X is a function of Y, we should use ξn(Y,X) instead of ξn(X,Y). A symmetric measure of dependence, if required, can be easily obtained by taking the maximum of ξn(X,Y) and ξn(Y,X). By Theorem 1.1, this symmetrized coefficient converges in probability to max{ξ(X,Y),ξ(Y,X)}, which is 0 if and only if X and Y are independent, and 1 if and only if at least one of X and Y is a measurable function of the other.

  2. It is clear that ξ(X,Y)[0,1] since var(1{Yt})var(E(1{Yt}|X)) for every t. If X and Y are independent, then E(1{Yt}|X) is a constant, and therefore, ξ(X,Y)=0. If Y is a measurable function of X, then E(1{Yt}|X)=1{Yt}, and so ξ(X,Y)=1. The converse implications are proved in the supplementary materials. The most nonobvious part of Theorem 1.1 is the convergence of ξn(X,Y) to ξ(X,Y). The proof of this, given in the supplementary materials, is quite lengthy. For the convenience of the reader (and to facilitate possible future improvements), a brief sketch of the proof is given in Section 8.

  3. In Theorem 1.1, there are no restrictions on the law of (X, Y) other than that Y is not a constant. In particular, X and Y can be discrete, continuous, light-tailed or heavy-tailed.

  4. The coefficient ξn(X,Y) remains unchanged if we apply strictly increasing transformations to X and Y, because it is based on ranks. For the same reason, it can be computed in time O(nlogn). We will see later that the actual computation on a computer is also very fast. The cost that we have to pay for fast computability, as we will see in Section 4.3, is that the test of independence based on ξn is sometimes less powerful than tests based on statistics whose computational times are quadratic in the sample size.

  5. The limiting value ξ(X,Y) has appeared earlier in the literature (Dette, Siburg, and Stoimenov Citation2013; Gamboa, Klein, and Lagnoux Citation2018). The paper (Dette, Siburg, and Stoimenov Citation2013) gives a copula-based estimator for ξ(X,Y) when X and Y are continuous, that is consistent under smoothness assumptions on the copula and appears to be computable in time n5/3 for an optimal choice of tuning parameters.

  6. The coefficient ξn looks similar to some coefficients defined earlier (Friedman and Rafsky Citation1983; Sarkar and Ghosh Citation2018), but in spite of its simple form, it seems to be genuinely new.

  7. Multivariate measures of dependence and conditional dependence inspired by ξn are now available in the preprint (Azadkia and Chatterjee Citation2019).

  8. If the Xi’s have ties, then ξn(X,Y) is a randomized estimate of ξ(X,Y), because of the randomness coming from the breaking of ties. This can be ignored if n is large, because ξn is guaranteed to be close to ξ by Theorem 1.1. Alternatively, one can consider taking the average of ξn over all possible increasing rearrangements of the Xi’s.

  9. If there are no ties among the Yi’s, the maximum possible value of ξn(X,Y) is (n2)/(n+1), which is attained if Y i = Xi for all i. This can be noticeably less than 1 for small n. For example, for n = 20, this value is approximately 0.86. Users should be aware of this fact about ξn. On the other hand, it is not very hard to prove that the minimum possible value of ξn(X,Y) is 1/2+O(1/n), and the minimum is attained when the top n/2 values of Yi are placed alternately with the bottom n/2 values. This seems to be paradoxical, since Theorem 1.1 says that the limiting value is in [0,1]. The resolution is that Theorem 1.1 only applies to iid samples. Therefore, a large negative value of ξn has only one possible interpretation: the data does not resemble an iid sample.

  10. An R package for calculating ξn and p-values for testing independence (based on the theory presented in the next section), named XICOR, is now available on CRAN (Chatterjee and Holmes Citation2020).

2 Testing Independence

The main purpose of ξn is to provide a measure of the strength of the relationship between X and Y, and not to serve as a test statistic for testing independence. However, one can use it for testing independence if so desired. In fact, it has a nice and simple asymptotic theory under independence. The next theorem gives the asymptotic distribution of nξn under the hypothesis of independence and the assumption that Y is continuous. The more general asymptotic theory in the absence of continuity is presented after that.

Theorem 2.1.

Suppose that X and Y are independent and Y is continuous. Then nξn(X,Y)N(0,2/5) in distribution as n.

The above result is essentially a restatement the main theorem of Chao, Bai, and Liang (Citation1993), where a similar statistic for measuring the “presortedness” of a permutation was studied. We will see later in numerical examples that the convergence in Theorem 2.1 happens quite fast. It is roughly valid even for n as small as 20.

If X and Y are independent but Y is not continuous, then also nξn converges in distribution to a centered Gaussian law, but the variance has a more complicated expression, and may depend on the law of Y. For each tR, let F(t):=P(Yt) and G(t):=P(Yt). Let ϕ(y,y):=min{F(y),F(y)}. Define(3) τ2=Eϕ(Y1,Y2)22E(ϕ(Y1,Y2)ϕ(Y1,Y3))+(Eϕ(Y1,Y2))2(EG(Y)(1G(Y)))2,(3) where Y1,Y2,Y3 are independent copies of Y. The following theorem generalizes Theorem 2.1.

Theorem 2.2.

Suppose that X and Y are independent. Then nξn(X,Y) converges to N(0,τ2) in distribution as n, where τ2 is given by the formula (3) stated above. The number τ2 is strictly positive if Y is not a constant, and equals 2/5 if Y is continuous.

The simple reason why τ2 does not depend on the law of Y if Y is continuous is that in this case F(Y) and G(Y) are Uniform[0,1] random variables, which implies that the expectations in (3) do not depend on the law of Y. If Y is not continuous, then τ2 may depend on the law of Y. For example, it is not hard to show that if Y is a Bernoulli(1/2) random variable, then τ2=1. Fortunately, if Y is not continuous, there is a simple way to estimate τ2 from the data using the estimatorτ̂n2=an2bn+cn2dn2,where an, bn, cn, and dn are defined as follows. For each i, let(4) R(i):=#{j:YjYi},L(i):=#{j:YjYi}.(4)

Let u1u2un be an increasing rearrangement of R(1),,R(n). Let vi:=j=1iuj for i=1,,n. Definean:=1n4i=1n(2n2i+1)ui2,bn:=1n5i=1n(vi+(ni)ui)2,cn:=1n3i=1n(2n2i+1)ui,dn:=1n3i=1nL(i)(nL(i)).

Then we have the following result.

Theorem 2.3.

The estimator τ̂n2 can be computed in time O(nlogn), and converges to τ2 almost surely as n.

I do not have the asymptotic theory for ξn(X,Y) when X and Y are dependent. Simulation results presented in Section 4.2 indicate that even under dependence, n(ξnξ) is asymptotically normal.

One may also ask about the asymptotic null distribution of the symmetrized statistic max{ξn(X,Y),ξn(Y,X)}. It is likely that under independence, this behaves like the maximum of a pair of correlated normal random variables. At this time I do not have a proof of this claim, nor a conjecture about the parameters of this distribution. Of course, it is easy to carry out a permutation test for independence using the symmetrized statistic.

The rest of the article is organized as follows. We begin with an amusing application of ξn to Galton’s peas data in Section 3. Various simulation results are presented in Section 4. An application to a famous gene expression dataset is given in Section 5. The inadequacy of MIC and maximal correlation for measuring the strength of relationship between X and Y is proved in Section 6. A summary of the advantages and disadvantages of using ξn is given in Section 7. A sketch of the proof of Theorem 1.1 is given in Section 8. Complete proofs of all the theorems of this section and the previous one are available in the supplementary materials, and also at https://arxiv.org/abs/1909.10140.

3 Example: Galton’s Peas Revisited

Sir Francis Galton’s peas data, collected in 1875, is one of the earliest and most famous datasets in the history of statistics. The data consists of 700 observations of mean diameters of sweet peas in mother plants and daughter plants. The exact process of data collection was not properly recorded; all we know is that Galton sent out packets of seeds to friends, who planted the seeds, grew the plants, and sent the seeds from the new plants back to Galton (see Stigler Citation1986, p. 296 for further details). The dataset is freely available as the “peas” data frame in the psych package in R.

Let X be the mean diameter of peas in a mother plant, and Y be the mean diameter of peas in the daughter plant. As already observed by Pearson long ago, the correlation between X and Y is around 0.35. The Xi’s have many ties in this data, which means that ξn(X,Y) is random due to the random breaking of ties. Averaging over 10,000 simulations gave a value close to 0.11 for ξn(X,Y). The p-value for the test of independence using Theorems 2.2 and 2.3 came out to be less than 0.0001, so ξn(X,Y) succeeded in the task of detecting dependence between X and Y.

Thus far, there is nothing surprising. The real surprise, however, was that the value of ξn(Y,X) (instead of ξn(X,Y)) turned out to be approximately 0.92 (and it appeared to be independent of the tie-breaking process). By Theorem 1.1, this means that X is close to being a noiseless function of Y. From the scatterplot of the data (), it is not clear how this can be possible. The mystery is resolved by looking at the contingency table of the data (). Each row of the table corresponds to a value of Y, and each column corresponds to a value of X. We notice that each column has multiple cells with nonzero counts, meaning that for each value of X there are many different values of Y in the data. On the other hand, each row in the table contains exactly one cell with a nonzero (and often quite large) count. That is, for any value of Y, every value of X in the data is the same.

Fig. 1 Scatterplot of Galton’s peas data. Thickness of a dot represents the number of data points at that location. (Figure courtesy of Susan Holmes.)

Fig. 1 Scatterplot of Galton’s peas data. Thickness of a dot represents the number of data points at that location. (Figure courtesy of Susan Holmes.)

Table 1 Contingency table for Galton’s peas data.

For example, among all mother plants with mean diameter 15, there were 46 cases where the daughter plant had diameter 13.77, 14 had diameter 14.77, 11 had diameter 16.77, 14 had diameter 17.77, and 4 had diameter 18.77. On the other hand, for all 46 daughter plants in the data with diameter 13.77, the mother plants had diameter 15. Similarly, for all 34 daughter plants with diameter 14.28, the mother plants had diameter 16.

Common sense suggests that the reason behind this strange phenomenon is surely some quirk of the data collection or recording method, and not some profound biological fact. (It is probably not a simple rounding effect, though; for instance, in all 46 cases where Y = 13.77, we have X = 15, but for all 37 cases where Y = 13.92, which is only slightly different than 13.77, we have X = 17.) However, if we imagine that the values recorded in the data are the exact values that were measured and the observations were iid (neither of which is exactly true, as I learned from Steve Stigler), then looking at there is no way to escape the conclusion that the mean diameter of peas in the mother plant can be exactly predicted with considerable certainty by the mean diameter of the peas in the daughter plant (but not the other way around). The coefficient ξn(Y,X) discovers this fact numerically by attaining a value close to 1. It is probable that this feature of Galton’s peas data has been noted before, but if so, it is certainly hard to find. I could not find any reference where this is mentioned, in spite of much effort.

4 Simulation Results

The goal of this section is to investigate the performance of ξn using numerical simulations, and compare it to other methods. We compare general performance, run times, and powers for testing independence.

4.1 General Performance, Equitability, and Generality

gives a glimpse of the general performance of ξn as a measure of association. The figure has three rows. Each row starts with a scatterplot where Y is a noiseless function of X, and X is generated from the uniform distribution on [1,1]. As we move to the right, more and more noise is added. The sample size n is taken to be 100 in each case, to show that ξn performs well in relatively small samples. In each row, we see that ξn(X,Y) is very close 1 for the leftmost graph, and progressively deteriorates as we add more noise. By Theorem 2.1, the 95th percentile of ξn(X,Y) under the hypothesis of independence, for n = 100, is approximately 0.066. The values in are all much higher than that.

Fig. 2 Values of ξn(X,Y) for various kinds of scatterplots, with n = 100. Noise increases from left to right. The 95th percentile of ξn(X,Y) under the hypothesis of independence is approximately 0.066.

Fig. 2 Values of ξn(X,Y) for various kinds of scatterplots, with n = 100. Noise increases from left to right. The 95th percentile of ξn(X,Y) under the hypothesis of independence is approximately 0.066.

An interesting observation from is that ξn appears to be an equitable coefficient, as defined in Reshef et al. (Citation2011). The definition of equitability is not mathematically precise but intuitively clear. Roughly, an equitable measure of correlation “gives similar scores to equally noisy relationships of different types.” indicates that ξn has this property as long as the relationship is “functional.” It is not equitable for relationships that are not functional, although that is expected because ξn measures how well Y can be predicted by X.

The other criterion for a good measure of correlation, according to Reshef et al. (Citation2011), is that the coefficient should be “general,” in that it should be able to detect any kind of pattern in the scatterplot. In statistical terms, this means that the test of independence based on the coefficient should be consistent against all alternatives. This is clearly true by Theorem 1.1, in fact more true than for any other coefficient in the literature. Among available test statistics, only maximal correlation has this property in full generality, but there is no estimator of maximal correlation that is known to be consistent for all possible distributions of (X, Y).

4.2 Validity of the Asymptotic Theory

Next, let us numerically investigate the distribution of ξn(X,Y) when X and Y are independent. Taking Xi’s and Yi’s to be independent Uniform[0,1] random variables, and n = 20, 10,000 values of ξn(X,Y) were generated. The histogram of nξn(X,Y) is displayed in , superimposed with the asymptotic density function predicted by Theorem 2.1. We see that already for n = 20, the agreement is striking. A much better agreement is obtained with n = 1000 in . Next, Xi’s and Yi’s were drawn as independent Binomial(3,0.5) random variables. The value of τ2 was estimated using Theorem 2.3, and was plugged into Theorem 2.2 to obtain the asymptotic distribution of nξn. Again, the true distributions are shown to be in good agreement with the asymptotic distributions, for n = 20 and n = 1000, in .

Fig. 3 Histogram of 10,000 simulations of nξn, superimposed with the asymptotic density function.

Fig. 3 Histogram of 10,000 simulations of nξn, superimposed with the asymptotic density function.

Some simulation analysis was also carried out to investigate the convergence of ξn under dependence. For that, the following simple model was chosen. Let XBernoulli(p) and ZBernoulli(p) be independent random variables, and let Y:=XZ. Then X and Y are dependent Bernoulli random variables. An easy calculation shows thatξ(X,Y)=p(1p)1pp.

With p = 0.4 and p=0.5, we get ξ(X,Y)=0.375. To test the convergence of ξn to ξ, 10,000 simulations were carried out with n = 1000. In this sample, the mean value of ξn was approximately 0.374 and the standard deviation was approximately 0.040 (which means that the standard deviation of nξn was approximately 1.254). The histogram given in shows an excellent fit with a normal distribution with the above mean and standard deviation.

Fig. 4 Histogram of 10,000 simulations of ξn(X,Y) when X and Y are dependent Bernoulli random variables (see Section 4.2), superimposed with the normal density function of suitable mean and variance. Here, ξ(X,Y)=0.375 and n = 1000.

Fig. 4 Histogram of 10,000 simulations of ξn(X,Y) when X and Y are dependent Bernoulli random variables (see Section 4.2), superimposed with the normal density function of suitable mean and variance. Here, ξ(X,Y)=0.375 and n = 1000.

4.3 Power and Run Time Comparisons

In this section, we compare the power of the test of independence based on ξn against a number of powerful tests proposed in recent years, and we also compare the run times of these tests. The main finding is that ξn is less powerful than some of the other tests if the signal is relatively smooth, and more powerful if the signal is wiggly. In terms of run time, ξn has a big advantage since it is computable in time O(nlogn), whereas its competitors require time n2. This is further validated through numerical examples, which show that ξn is essentially the only statistic that can be computed in reasonable time if the sample size is in the order several thousands.

Comparisons are carried out with the following popular test statistics for testing independence. I excluded statistics that are either too new (because they are not time-tested, and software is not available in many cases) or too old (because they are superseded by newer ones). In the following, (X1,Y1),,(Xn,Yn) is an iid sample of points from some distribution on R2.

  1. Maximal information coefficient (MIC) (Reshef et al. Citation2011): Recall that the mutual information of a bivariate probability distribution is the Kullback–Leibler divergence between that distribution and the product of its marginals. Given any scatterplot of n points, suppose we divide it into an x × y array of rectangles. The proportions of points falling into these rectangles define a bivariate probability distribution. Let I be the mutual information of this probability distribution. The maximum of I/logmin{x,y} over all subdivisions into rectangles, under the constraint xy<n0.6, is called the maximal information coefficient of the scatterplot.

  2. Distance correlation (Székely, Rizzo, and Bakirov Citation2007): Let aij:=|XiXj| and bij:=|YiYj|. Center these numbers by defining Aij:=aijai·a·j+a·· and Bij:=bijbi·b·j+b··, where ai· is the average of aij over all j, etc. The distance correlation between the two samples is simply the Pearson correlation between the Aij’s and the Bij’s.

  3. The HHG test (Heller, Heller, and Gorfine Citation2013): Take any i and j. Divide Xk’s into two groups depending on whether |XiXk|<|XiXj| or not. Similarly classify the Yk’s into two groups depending on whether |YiYk|<|YiYj| or not. These classifications partition the scatterplot into four compartments, and the numbers of points in these compartments define a 2 × 2 contingency table. The HHG test statistic is a linear combination of the Pearson χ2 statistics for testing independence in these contingency tables over all choices of i and j.

  4. The Hilbert–Schmidt independence criterion (HSIC) (Gretton et al. Citation2005, Citation2008): Let k and l be symmetric positive definite kernels on R2. For example, we may take the Gaussian kernel k(x,y)=l(x,y)=e|xy|2/2σ2 for some σ>0. Let kij:=k(Xi,Xj) and lij:=l(Yi,Yj). Then the HSIC statistic is1n2i,jkijlij+1n4i,j,q,rkijlqr2n3i,j,qkijliq.

All of the above test statistics are consistent for testing independence under mild conditions. Moreover, the HSIC test has been proved to be minimax rate-optimal against uniformly smooth alternatives (Li and Yuan Citation2019).

Power comparisons were carried out with sample size n = 100. In each case, 500 simulations were used to estimate the power. The R packages energy, minerva, HHG, and dHSIC were used for calculating the distance correlation, MIC, HHG, and HSIC statistics, respectively. Since the HHG test is very slow for large samples, a fast univariate version of the HHG test (Heller et al. Citation2016) was used. Generating X from the uniform distribution on [1,1], the following six alternatives were considered:

  1. Linear: Y=0.5X+3λε, where λ is a noise parameter ranging from 0 to 1, and εN(0,1) is independent of X.

  2. Step function: Y=f(X)+10λε, where f takes values –3, 2, –4, and –3 in the intervals [1,0.5),[0.5,0),[0,0.5), and [0.5,1].

  3. W-shaped: Y=|X+0.5|1{X<0}+|X0.5|1{X0}+0.75λε.

  4. Sinusoid: Y=cos8πX+3λε.

  5. Circular: Y=Z1X2+0.9λε, where Z is 1 or –1 with equal probability, independent of X.

  6. Heteroscedastic: Y=3(σ(X)(1λ)+λ)ε, where σ(X)=1 if |X|0.5 and 0 otherwise. As λ increases from 0 to 1, the relationship becomes more and more homoscedastic.

The coefficients in all of the above were chosen to ensure that a full range of powers were observed as λ was varied from 0 to 1. The results are presented in . The main observation from this figure is that ξn is more powerful than the other tests when the signal has an oscillatory nature, such as for the W-shaped scatterplot and the sinusoid. For the step function, too, it performs reasonably well. However, ξn has inferior performance for smoother alternatives, namely, the linear, circular, and heteroscedastic scatterplots.

Fig. 5 Comparison of powers of several tests of independence. The titles describe the shapes of the scatterplots. The level of the noise increases from left to right. In each case, the sample size is 100, and 500 simulations were used to estimate the power.

Fig. 5 Comparison of powers of several tests of independence. The titles describe the shapes of the scatterplots. The level of the noise increases from left to right. In each case, the sample size is 100, and 500 simulations were used to estimate the power.

Next, let us turn to the comparison of run times for tests of independence based on the five competing test statistics. For all except ξn, the only way to test for independence is to run a permutation test. (There is a theoretical test for HSIC, but it is only a crude approximation.) The number of permutations was taken to be the smallest respectable number, 200. Usually 200 is too small for a permutation test, but I took it to be so small so that the program terminates in a manageable amount of time for the larger values of n. For ξn, the asymptotic test was used because it performs as well as the permutation test even in very small samples, as we saw in Section 4.2.

For distance correlation, HSIC, and HHG, the permutation tests are directly available from the corresponding R packages. For MIC, I had to write the code because the permutation tests are not automatically available from the package, so the run time can probably be somewhat improved with a better code. For the HHG test, the function requires the distance matrices for X and Y to be input as arguments. For the sake of fairness, the time required for computing the distance matrices was included in the total time for carrying out the permutation tests.

The results are presented in . Every test was hundreds or even thousands of times slower than the test based on ξn for all sample sizes 500 and above. For sample size 10,000, the HHG test was terminated after not converging in 30 min.

Table 2 Run times (in sec) for permutation tests of independence, with 200 permutations.

5 Example: Yeast Gene Expression Data

In a landmark paper in gene expression studies (Spellman et al. Citation1998), the authors studied the expressions of 6223 yeast genes with the goal of identifying genes whose transcript levels oscillate during the cell cycle. In lay terms, this means that the expressions were studied over a number of successive time points (23, to be precise), and the goal was to identify the genes for which the transcript levels follow an oscillatory pattern. This example illustrates the utility of correlation coefficients in detecting patterns, because the number of genes is so large that identifying patterns by visual inspection is out of the question.

This dataset was used in the paper (Reshef et al. Citation2011) to demonstrate the efficacy of MIC for identifying patterns in scatterplots. The authors of Reshef et al. (Citation2011) used a curated version of the dataset, where they excluded all genes for which there were missing observations, and made several other modifications. The revised dataset has 4381 genes. I used this curated dataset (available through the R package minerva) to study the power of ξn in discovering genes with oscillating transcript levels, and compare its performance with the competing tests from Section 4.3.

There are literally hundreds of papers analyzing this particular dataset. I will not attempt to go deep into this territory in any way, because that will take us too far afield. The sole purpose of the analysis that follows is to compare the performance of ξn with the competing tests.

For each test, p-values were obtained and a set of significant genes were selected using the Benjamini–Hochberg FDR procedure (Benjamini and Hochberg Citation1995), with the expected proportion of false discoveries set at 0.05.

It turned out that there are 215 genes (out of 4381) that are selected by ξn but by none of the other tests. This is surprising in itself, but what is more surprising is the nature of these genes. shows the transcript levels of the top 6 of these genes (that is, those with the smallest p-values). There is no question that these genes exhibit almost perfect oscillatory behavior and yet they were not selected by any of the other tests.

Fig. 6 Transcript levels of the top 6 among the 215 genes selected by ξn but by no other test. The dashed lines are fitted by k-nearest neighbor regression with k = 3. The name of the gene is displayed below each plot.

Fig. 6 Transcript levels of the top 6 among the 215 genes selected by ξn but by no other test. The dashed lines are fitted by k-nearest neighbor regression with k = 3. The name of the gene is displayed below each plot.

One may wonder if this is true for only the top 6 genes, or typical of all 215. To investigate that, I took a random sample of 6 genes from the 215, and looked at their transcript levels. The results are shown in . Even for a random sample, we see strong oscillatory behavior. This behavior was consistently observed in other random samples.

Fig. 7 Transcript levels of a random sample of 6 genes from the 215 genes that were selected by ξn but by no other test.

Fig. 7 Transcript levels of a random sample of 6 genes from the 215 genes that were selected by ξn but by no other test.

How about the genes that were selected by at least one of the other tests, but not by ξn? shows the transcript levels of a random sample of 6 genes selected from this set. I think it is reasonable to say that these plots show slight increasing or decreasing trends, or heteroscedasticity, but no definite oscillatory patterns. Repeated samplings showed similar results.

Fig. 8 Transcript levels of 6 randomly sampled genes from the set of genes that were not selected by ξn but were selected by at least one other test.

Fig. 8 Transcript levels of 6 randomly sampled genes from the set of genes that were not selected by ξn but were selected by at least one other test.

Thus, we arrive at the following conclusion. The genes selected by ξn are much more likely than the genes selected by the other tests to be the ones that really exhibit oscillatory patterns in their transcript levels during the cell cycle. This is because the other tests prioritize monotone trends over cyclical patterns. Most of the 215 genes that were selected by ξn but not by any of the other tests show pronounced oscillatory patterns. The fact that ξn is particularly powerful for detecting oscillatory behavior turns out to be very useful in this example. Of course, ξn also selects genes that show other kinds of patterns (it selects a total of 586 genes), but those are selected by at least one of the other tests and therefore do not appear in this set of 215 genes that are selected exclusively by ξn.

6 MIC and Maximal Correlation May Not Correctly Measure the Strength of the Relationship

It is sometimes mistakenly believed that MIC and maximal correlation measure the strength of relationship between X and Y; in particular, that they attain their maximum value, 1, if and only if the relationship between X and Y is perfectly noiseless. In this section we show that this is not true: MIC and maximal correlation can detect noiseless relationships even if the actual relationship between X and Y is very noisy.

In the example shown in , 200 samples of (X, Y) are generated from a mixture of bivariate normal distributions. With probability 1/2, (X, Y) is drawn from the standard bivariate normal distribution, and with probability 1/2, (X, Y) is drawn from the bivariate normal distribution with mean (5, 5) and identity covariance matrix. The data forms two clusters of roughly equal size that are close but nearly disjoint. Clearly, there is a lot of noise in the relationship between X and Y. Given X, we can only tell whether Y comes from N(0, 1) or N(5, 1), but nothing else. Yet, rounded off to two decimal places, MIC is 1.00 and maximal correlation (as computed by the ACE algorithm, Breiman and Friedman Citation1985) is 0.99 for this scatterplot. The coefficient ξn, on the other hand, is well-behaved; it turns out to be 0.48, indicating the presence of a significant relationship between X and Y but not a noiseless one. Common sense suggests that the value 0.48 is much better reflective of the strength of the relationship between X and Y in than 0.99 or 1.00.

Fig. 9 Scatterplot of a mixture of bivariate normals, with n = 200. For this plot, maximal correlation =0.99, MIC =1.00, and ξn=0.48.

Fig. 9 Scatterplot of a mixture of bivariate normals, with n = 200. For this plot, maximal correlation =0.99, MIC =1.00, and ξn=0.48.

In the supplementary materials of Reshef et al. (Citation2011), it is shown that MIC =1 when Y=f(X) for a large class of functions f. However, it is not shown that the converse is true, that is MIC =1 implies that X and Y have a noiseless relationship. indicates that in fact the converse is probably not true. The phenomenon is not an artifact of the sample size—it remains consistently true in larger sample sizes. Moreover, scatterplots such as are not uncommon in real datasets.

The following mathematical result uses the intuition gained from the above example to confirm that there indeed exist very noisy relationships which are declared to be perfectly noiseless by maximal correlation and MIC.

Proposition 6.1.

Let I1, I2, J1, and J2 be bounded intervals such that I1 and I2 are disjoint, and J1 and J2 are disjoint. Suppose that the law of a random vector (X, Y) is supported on the union of the two rectangles I1×J1 and I2×J2, giving equal masses to both. Then the maximal correlation between X and Y is 1, and the MIC between X and Y in an iid sample of size n tends to 1 in probability as n.

Proof.

Recall that the maximal correlation between two random variables X and Y is defined as the maximum possible correlation between f(X) and g(Y) over all f and g such that f(X) and g(Y) are square-integrable. In the setting of this proposition, let f be the indicator of the interval I1 and g be the indicator of the interval J1. Then f(X) = 1 if and only if g(Y) = 1, because the nature of (X, Y) implies that XI1 if and only if YJ1. Thus, f(X)=g(Y), and so the maximal correlation between X and Y is equal to 1.

Next, recall the definition of MIC from Section 4.3. The support of (X, Y) can be partitioned into the 2 × 2 array of rectangles I1×J1,I1×J2,I2×J1, and I2×J2. The first and fourth rectangles carry mass 1/2 each, and the other two carry mass 0. Therefore, when n is large, the first and fourth rectangles receive approximately n/2 points each, and the other two receive no points. A simple calculation shows that the mutual information of the corresponding contingency table is approximately log2. Thus, the contribution of this array of rectangles to the definition of MIC is approximately 1, which shows that the MIC itself is approximately 1 (since it cannot exceed 1 and is defined to be the maximum of the contributions from all rectangular partitions of size <n0.6). □

7 Summary

Let us now briefly summarize what we learned. The new correlation coefficient offers many advantages over its competitors. The following is a partial list:

  1. It has a very simple formula. The formula is as simple as those for the classical coefficients, like Pearson’s correlation, Spearman’s ρ, or Kendall’s τ.

  2. Due to its simple formula, it is (a) easy to understand conceptually, and (b) computable very quickly, not only in theory but also in practice. Most of its competitors are hundreds of times slower to compute even in samples of moderately large size, such as 500.

  3. It is a function of ranks, which makes it robust to outliers and invariant under monotone transformations of the data.

  4. It converges to a limit which has an easy interpretation as a measure of dependence. The limit ranges from 0 to 1. It is 1 if and only if Y is a measurable function of X and 0 if and only if X and Y are independent. Thus, ξn gives an actual measure of the strength of the relationship.

  5. It has a very simple asymptotic theory under the hypothesis of independence, which is roughly valid even for samples of size as small as 20. This allows theoretical tests of independence, bypassing computationally expensive permutation tests that are necessary for other tests.

  6. The test of independence based on ξn is consistent against all alternatives, with no exceptions. No other test has this property.

  7. None of the results mentioned above require any assumptions about the law of (X, Y) except that Y is not a constant. One can even apply ξn to categorical data, by converting the categorical variables to integer-valued variables in any arbitrary way.

  8. In simulations and real data, ξn seems to be more powerful than other tests for detecting oscillatory signals.

Against all of the above advantages, ξn has only one disadvantage: It seems to have less power than several popular tests of independence when the signal is smooth and nonoscillatory. Although such signals comprise the majority of types observed in practice, this is a matter of concern only when the sample size is small. In large samples, all tests are powerful, and computational time becomes a much bigger concern.

8 Proof Sketch

This section contains a brief sketch of the proof of convergence of ξn to ξ. For simplicity, let us only consider the case of continuous X and Y. First, note that by the Glivenko–Cantelli theorem, ri/nF(Y(i)), where F is the cumulative distribution function of Y. Thus,(5) ξn(X,Y)13ni=1n|F(Yi)F(YN(i))|,(5) where N(i) is the unique index j such that Xj is immediately to the right of Xi if we arrange the X’s in increasing order. If Xi is the rightmost value, define N(i) arbitrarily; it does not matter since the contribution of a single term in the above sum is O(1/n).

The first important observation is that for any x,yR,(6) |F(x)F(y)|=(1{tx}1{ty})2dμ(t),(6) where μ is the law of Y. This is true because the integrand is 1 between x and y and 0 outside.

Now suppose that we condition on X1,,Xn. Since Xi is likely to be very close to XN(i), the random variables Yi and YN(i) are likely to be approximately iid after this conditioning. This is the second key observation (which is tricky to make rigorous in the absence of any assumptions on the law of (X, Y)), which leads to the approximationE[(1{tYi}1{tYN(i)})2|X1,,Xn]2var(1{tYi}|X1,,Xn)=2var(1{tYi}|Xi).

This givesE(1{tYi}1{tYN(i)})22E[var(1{tY}|X)]=2var(1{tY})2var(E(1{tY}|X)).

Combining this with (6), we getE|F(Yi)F(YN(i))|2[var(1{tY})var(E(1{tY}|X))]dμ(t).

But note that var(1{tY})=F(t)(1F(t)), and F(Y)Uniform[0,1]. Thus,var(1{tY})dμ(t)=F(t)(1F(t))dμ(t)=01x(1x)dx=16.

Therefore by (5),E(ξn(X,Y))6var(E(1{tY}|X))dμ(t)=ξ(X,Y),where the last identity holds because var(1{tY})dμ(t)=1/6, as shown above. This establishes the convergence of E(ξn(X,Y)) to ξ(X,Y). Concentration inequalities are then used to show that ξn(X,Y)E(ξn(X,Y))0 almost surely.

Supplementary Materials

The supplementary material consists of a single pdf file containing the proofs of Theorems 1.1, 2.2 and 2.3. (Theorem 2.1 is a special case of Theorem 2.2, so it does not have a separate proof.)

Supplemental material

Supplemental Material

Download PDF (270.3 KB)

Acknowledgments

I thank Mona Azadkia, Peter Bickel, Holger Dette, Mathias Drton, Lihua Lei, Bodhisattva Sen, Rik Sen, and Steve Stigler for a number of useful comments and references. I am especially grateful to Persi Diaconis and Susan Holmes for many suggestions that greatly improved the article. Lastly, I thank the anonymous referees for a number of suggestions that helped improve the presentation.

Additional information

Funding

Research partially supported by NSF grant DMS-1855484.

References

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.