1,027
Views
1
CrossRef citations to date
0
Altmetric
Articles

A new exact p-value approach for testing variance homogeneity

, &
Pages 81-86 | Received 29 Feb 2020, Accepted 20 Mar 2021, Published online: 22 Apr 2021

Abstract

To test variance homogeneity, various likelihood-ratio based tests such as the Bartlett's test have been proposed. The null distributions of these tests were generally derived asymptotically or approximately. We re-examine the restrictive maximum likelihood ratio (RELR) statistic, and suggest a Monte Carlo algorithm to compute its exact null distribution, and so its p-value. It is much easier to implement than most existing methods. Simulation studies indicate that the proposed procedure is also superior to its competitors in terms of type I error and powers. We analyse an environmental dataset for an illustration.

1. Introduction

Homogeneity of variances among populations or factor levels plays a fundamental role in analysis of variance (ANOVA) and many statistical analysis approaches. For example, ANOVA inferences are generally slightly affected by unequal variances if the model contains only fixed factors and has equal or almost equal sample sizes. On the other hand, the inference results based on the ANOVA models with random effects or unequal sample sizes can be substantially affected by the inequality of variances. Bartlett (Citation1937) developed a modified likelihood-ratio test and derived the associated asymptotic distribution of the test, which can control type I error under the normality assumption. However, its performance in small sample sizes is not attractive as pointed out by Bishop and Nair (Citation1939) and Hartley (Citation1940). Since then various efforts have been made to improve Bartlett's test. Representative work includes (Boos & Brownie, Citation1989; Box, Citation1953; Brown & Forsythe, Citation1974; Cochran, Citation1951; Hartley, Citation1950; Levene, Citation1960; Pardo et al., Citation1997). Recently, there is recognition that variability itself can be a major issue. For instance, Teschendorff and Widschwendter (Citation2012) argued that in cancer genomics, differential variability can be as important as differential means for predicting disease phenotypes, and indicates that understanding heterogeneity can be crucial.

Since the common critical values are given using chi-squared distribution approximation, various variants from large-sample or numerical approximation-based aspects have been proposed. These tests generally work well in the large-sample sense, but they are not exact tests in the sense of frequency.

In order to obtain an exact (or nearly exact) test for checking homogeneity of variances under normal distribution, additional efforts have further been made in several ways. For example, Wu and Wong (Citation2003) provided a critical value approximation approach through the saddle point approximation. Bhandary and Dai (Citation2009) proposed a test (BDT) based on Benforroni type adjustment procedure on the ordered p-value. Liu and Xu (Citation2010) proposed a generalized p-value test (GPT) by employing the generalized inference (Tian, Citation2005Citation2007; Weerahandi, Citation2004). Ma et al. (Citation2015) suggested an adjusted Bartlett's test (ABT) on the basis of the equal mean principle. Gokpinar and Gokpinar (Citation2017) re-examined the computational approach test (CAT), that was originally introduced by Pal et al. (Citation2007). Each of these methods has their own merits under certain favourable circumstances. Gokpinar and Gokpinar (Citation2017) compared the four tests, BAR, BDT, GPT and CAT, in terms of the type I error rate and the power, and concluded that CAT appears to be more powerful than other three tests when the group size is small or moderate, and further confirmed that BAR could not maintain type I error rates as well as could be conservative in small sample sizes.

In this paper, we develop a practically useful procedure to calculate the null distribution; i.e., the p-value, of the restrictive maximum likelihood-ratio (RELR) statistic. The procedure has nice statistical properties as aforementioned Bartlett type of tests in large sample sizes. Its small-sample performance is attractive and superior to its competitors in most situations. Most importantly, it is very easily implemented and computationally expedient from practical perspectives.

The paper is organized as follows. Section 2 briefly describes the framework and introduces Bartlett test. In Section 3, we re-examine the RELR statistic and suggest a Monte Carlo algorithm for computing its p-value. Section 4 presents simulation results to evaluate the small-sample performance of the proposed test and to compare with some existing methods. We analyse a real dataset to compare the six tests for illustrating the utility of the proposed test in Section 5, and remark the paper with a discussion in Section 6.

2. Framework and Bartlett test

Let {Xi1,Xi2,,Xini;i=1,,k} be k groups of independent random samples from the normal populations N(μi,σi2) for i=1,,k. The test of variance homogeneity can be formulated as (1) H0:σ12==σk2vs.H1:σi2σj2forsomeij.(1) Let X¯i=j=1niXij/j=1niXijnini and Si2=j=1ni(XijX¯i)2/j=1ni(XijX¯i)2(ni1)(ni1) be the sample mean and variance of the ith population, i=1,,k, and N=i=1kni be the total sample size. It is well-known that the restrictive maximum likelihood-ratio (RELR) test statistic for the hypothesis (Equation1) is (2) Tn=(Nk)log{i=1k(ni1)Si2Nk}i=1k(ni1)logSi2,(2) and the p-value of the test is given by p=PH0{Tnt},where (3) t=(Nk)log{i=1k(ni1)si2Nk}i=1k(ni1)logsi2(3) with si2 being the observed Si2 based on the data. Since it is generally impossible to derive the exact distribution of Tn, Bartlett (Citation1937) modified Tn to Tn,B={1+13(k1)(i=1k1ni11Nk)}1Tn,and showed that Tn,B is asymptotically chi-squared with degrees of freedom k−1 as minini, though this approximation is not necessary when k = 2 because the corresponding RELR statistic is a monotonic function of the F ratio. Consequently, the null hypothesis is suggested to be rejected if Tn,B>χk1,1α2 given the significance level α.

3. The proposed procedure

Under the null hypothesis H0, let σ12==σk2=σ02, and note that the RELR statistic given in (Equation2) can be expressed as follows. Tn=(Nk)log{i=1k(ni1)Si2/σ02Nk}i=1k(ni1)log(Si2/σ02),this expression motivates us to introduce a new quantity as Tn,NEW=(Nk)log{i=1k(ni1)Si2/σi2Nk}i=1k(ni1)log(Si2/σi2),Tn,NEW is not a statistic any more because it contains parameters σi2's. Since (ni1)Si2/σi2 are independently chi-squared variables with ni1 degrees of freedom, for i=1,,k. Write Ri=(ni1)Si2/σi2, Tn,NEW could be rewritten as a new quantity (4) Tn,NEW=(Nk)logi=1kRiNki=1k(ni1)logRini1.(4) which is independent of all unknown σi2's. Therefore, we may derive the distribution of Tn,NEW, equivalently the distribution of Tn, under H0. Consequently, we can calculate the p-value of the test (Equation1) as p=PH0{Tn,NEWt}. Hence, the power function of the test could be given by p(t)=P{Tn,NEWt}.It may not be easy to derive the distribution of Tn,NEW in practice, we therefore alternatively calculate the p-value by Monte Carlo simulation. Specifically, we calculate the power via the following algorithm.

4. Simulation studies

In this section, we report simulation results to evaluate the performance of the proposed testing procedure. For the comparison purpose, we examine the following tests: Bartlett test (BAR, Bartlett, Citation1937), the adjusted Bartlett's test (ABT, Ma et al., Citation2015), the generalized p-value test (GPT, Liu & Xu, Citation2010), the Bhandary and Dai's test (BDT, Bhandary & Dai, Citation2009), the computational approach test (CAT, Pal et al., Citation2007), and the RELR test. The criterion for analysing the performance of the methods is to compare the type I error and power properties of tests.

In what follows, we set μi=0, i=1,k, and denote σ2=(σ12,,σk2). a;r stands for a vector, in which a are replicated r times, and a;rK means to remain the first K elements of a;r when it contains more than K elements. For example, (2,5,10);3=(2,5,10,2,5,10,2,5,10), 6;4=(6,6,6,6); and (2,5,10);37=(2,5,10,2,5,10,2).

To examine the performance of these tests, the parameter setting of the simulation studies are as follows: (1) The number of samples equals 2, 5, 10, 15, 30, 50; (2) Different combinations of group size k and sample sizes n are given in the first two columns of Table ; (3) We set σi21 for i=1,,k for calculating the type I errors, and consider various degrees of variance heterogeneity listed in the following box for the power comparison.

Table 1. Simulated type I errors.

(a1)

k = 2, σ2=(1,3)

(a2)

k = 2, σ2=(1,5)

(b1)

k = 5, σ2=(0.5,1.25,2,2.75,3.5)

(b2)

k = 5, σ2=(1,4,6,8,10)

(c1)

k = 10, σ2=(0.5,1.5,3,4.5,6);2

(c2)

k = 10, σ2=(1,2,3,4,5,6,7,8,9,10)

(d1)

k = 15, σ2=(0.5,1.5,3,4.5,3,1.5);315

(d2)

k = 15, σ2=(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15)

(e1)

k = 30, σ2=(1,3,6,9,12);6

(e2)

k = 30, σ2=(1,2,4,6,4,2,1);530

(f1)

k = 50, σ2=(1,2,4,6,4,2,1);850;

(f2)

k = 50, σ2=(1,3,6,9,12);10

For each pattern and parameter, we generated N = 5000 simulation data sets. For each simulated data set and the real data set in the next section, we let M=5000 to obtain the p-value of the GPT, CAT and RELR test. The empirical or power is the proportion of rejecting the null hypothesis among N=5000 simulation runs. We used the nominal significance level of α=0.05 in our simulation studies.

Table  reports type I errors of the six tests under different parameter configurations. As can be seen, the Bartlett's test is generally conservative in all configurations, while CAT often fails to control the type I error rate. The type I errors of other competitors are generally smaller than that of CAT but larger than that of BAR except when their type I errors all close to the nominal level. These numerical results indicate that ABT, GPT, BDT, and RELR have good type I error control for almost all the situations.

Figures  and  present the powers of the six tests for the 12 situations, (a1)–(f2), against the cases (specified in the 3rd column of Table .) From the results we can conclude that when the group size k is small (5), all tests yield a similar power pattern. This indicates that their performance very closes. When the group size k increases to moderate size like 10 or 15, the powers of BAR, ABT and RELR still show a similar pattern, and are higher than those of GPT, BDT and CAT. This indicates that BAR, ABT and RELR are superior to GPT, BDT and CAT. This superiority become more distinctive when the group size k increases to 30 or 50. For example, When k = 50, σ2=(1,3,6,9,12);10, n=(3,4,5,4,3);10, the powers of CAT, BDT and GPT are 0.70, 0.517 and 0.309, respectively, while the powers of BAR, ABT and RELR are 0.827, 0.830, and 0.844, respectively (corresponding to case 4 of (f2) in Figure ). Overall, RELR can effectively control the type I error, and its power is higher (or at least the same) than the other five tests for almost all configurations.

Figure 1. Simulation results for Settings (k = 2, 5, 10) corresponding to various cases. BAR: Bartlett's test (red line with filled square); ABT: Ma et al.'s test ; GPT: Liu & Xu's test; BDT: Bhandary & Dai's test; CAT: Gokpinar & Gokpinar's test; and LRT.

Figure 1. Simulation results for Settings (k = 2, 5, 10) corresponding to various cases. BAR: Bartlett's test (red line with filled square); ABT: Ma et al.'s test ; GPT: Liu & Xu's test; BDT: Bhandary & Dai's test; CAT: Gokpinar & Gokpinar's test; and LRT.

Figure 2. Simulation results for Settings (k = 15, 30, 50) corresponding to various cases. The caption is the same as in Figure .

Figure 2. Simulation results for Settings (k = 15, 30, 50) corresponding to various cases. The caption is the same as in Figure 1.

5. Real data example

In this section, we analyse the dataset for the detrended particulate matter (pm10) of Maryland in 1990 by using the six tests to investigate the seasonal effect on pm10 variability. After removing missing observations, we have 88, 88, 97, and 74 observations within Spring, Summer, Fall, and Winter. Let σi2 be their variances for i = 1, 2, 3, 4, respectively. This concern can then be formulated as the null hypothesis: H0:σ12=σ22=σ32=σ42.

We compute the p-value using the six tests with M = 10000. The corresponding p-values for BAR, ABT, GPT, CAT, RELR are pBAR=0.0505, pABT=0.0505, pGPT=0.0552, pCAT=0.0567 and pRELR=0.0495, and BDT indicates that we fail to reject the null hypothesis for given 5% significant level. So all tests except the proposed RELR suggest that we could not reject the null hypothesis, while only RELR suggest a rejection, though these p-values are slightly different. Recalling our simulation results, we prefer the result based on RELR, and conclude that the variances among the four seasons are not homogeneous.

6. Concluding remarks

In this paper, we have proposed a procedure for calculating the p-value of the restrictive likelihood ratio test for variance homogeneity. The procedure is very easy to implement and performs promising. Given the optimality of the likelihood ratio principle, we conjecture that the test could be most efficient, which warrants a further investigation. This paper provides a means to calculate the p-value when it is difficult, if not impossible, to derive (asymptotic) distribution of the proposed test statistic. However, there is no a general guideline to reformulate Tn in (Equation2). So deriving a quantity similar to Tn,NEW may be case by case. Whether the proposed procedure can be applied to high-dimensional (in the sense of diverging with the sample size) situations is unclear and also warrants further research.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

The research of Li was supported by Grant 11871294 from National Natural Science Foundation of China. Liang's research was partially supported by NSF grant DMS-1620898.

Notes on contributors

Juan Wang

Juan Wang is an assistant professor of School of Mathematics and Statistics at Qingdao University.

Xinmin Li

Xinmin Li is a professor of School of Mathematics and Statistics at Qingdao University.

Hua Liang

Hua Liang is a professor of Department of Statistics at George Washington University.

References