3,227
Views
5
CrossRef citations to date
0
Altmetric
Research Article

Testing homogeneity of effect sizes in pooling 2x2 contingency tables from multiple studies: a comparison of methods

& | (Reviewing Editor)
Article: 1478698 | Received 03 Jan 2018, Accepted 13 May 2018, Published online: 14 Jun 2018

Abstract

Meta-analysis is a statistical methodology that combines the outcomes of several independent studies. Either a fixed-effects or a random-effects model is used to combine the results of the independent studies. The choice depends on whether the outcomes can be considered homogeneous or not. Several methods have been developed to test the homogeneity of effect sizes. Although many of them have been compared already, they have not been studied in more realistic practical settings where individuals all have their own risk of an event and a complete overview is lacking. In this article, we investigate the performance of 10 statistical methods to test homogeneity of a binary exposure on a binary clinical outcome, using an extensive simulation study and a real life meta-analysis. We evaluated the Type I error and the statistical power. The fixed-effects regression model for treatment and study interaction was found to be a slightly liberal test, while the Q-statistic and the Bliss statistic can be considered conservative tests. The random-effects regression model, the Peto statistic and the I2 perform rather poorly compared to the other methods. All chi-square tests that are based on the calculation of a specific odds ratio and the fixed-effects logistic regression analysis perform best. Among these chi-square tests, we recommend the Breslow–Day test, but only for convenience purposes, since it is available in most statistical software packages and the other tests are not.

PUBLIC INTEREST STATEMENT

Testing homogeneity of effect sizes (treatment effects, risk factor associations) is an essential part of meta-analysis since it helps us to interpret the consistency of these effects across groups of participants. The testing can be done using various statistical tools, which can result in different conclusions. For this reason we thought it was important to study the performance of the existing statistical tests for homogeneity. Specific comparisons have been done earlier but we also tried to explore the performance of statistical tests that have never been applied to this specific problem, and to adjust existing statistical tests. We studied the performance of 10 different methods to test the homogeneity of effect sizes using a case study and an extensive simulation study. Based on our study, we recommend one specific statistical test for use for meta-analysts when testing the homogeneity since it outperformed the others in most of the settings we studied.

1. Introduction

Pooling (analysis) results (like odds ratios) from multiple studies is common practice in medical, social and biological sciences. It is the main goal in an aggregate data meta-analysis or in an experimental study where some kind of stratification is applied (like multi-center randomized controlled trails). Researchers would typically be interested mainly in the overall or pooled estimate of a binary exposure or treatment effect, regardless of study differences, since it may better represent an estimate for the full target population than the individual studies. For a binary outcome, this means pooling 2 × 2 contingency tables from several studies. A traditional approach for pooling 2 × 2 contingency tables is the Cochran–Mantel–Haenszel statistic (Breslow, Citation1994; Breslow & Day, Citation1980; Mantel & Haenszel, Citation1959). However, this approach requires that the odds ratios across studies are homogeneous. This emphasizes the need for a test on homogeneity of the odds ratios or more generally the homogeneity of treatment effects on the binary clinical outcome. This topic is not new and many different approaches have been developed.

A classic test for homogeneity of effect sizes is the Q-statistic (Cochran, Citation1954). It creates a weighted effect size using the squared inverse within-study standard error and evaluates if the variation around this pooled effect is larger than that can be expected from the within-study standard errors. It was initially developed for pooling potency estimates from bioassay experiments (Mzolo, Hendriks, & van den Heuvel, Citation2013). A similar approach used in this area is the method of Bliss (Citation1952), which uses the Q-statistic but in a different way. Related measures that are used in systematic reviews are the H2, I2 and the R2 statistics (Higgins & Thompson, Citation2002). The I2 statistic is also fully determined by the Q-statistic and it was initially developed to quantify the amount of heterogeneity instead of testing for homogeneity. The H2 statistic is just an alternative measure for I2, but the R2 statistic describes the inflation from the random effects to the fixed effects, and it is viewed more as a measure than as a test statistic. However, the three measures of heterogeneity can be used as test statistics using appropriate confidence intervals.

Alternatively, fixed- and random-effects logistic regression models can be applied to test homogeneity. This is performed by including an interaction term between study and treatment, and then applying either a Wald test, a score test, or the Likelihood ratio test for its significance (Agresti, Citation2002).

A method that studies the heterogeneity of the odds ratio directly is the Breslow–Day test adjusted by Tarone (Citation1985). It is some form of Pearson’s chi-square statistic. Other types of chi-square test statistics, which have been developed for specific situations, are Peto statistic (Yusuf, Peto, Lewis, Collins, & Sleight, Citation1986), Liang and Self statistics (Liang & Self, Citation1989) and Zelen statistic (Zelen, Citation1971). Yusuf et al. (Citation1986) originally presented the Peto method for pooling odds ratios, and accompanied it with a homogeneity test (Yusuf et al., Citation1986). Liang and Self (Citation1989) focused on developing test statistics in case data are sparse, i.e. some cells in the study-specific contingency tables contain only a few participants. Zelen developed the Zelen statistic to test the homogeneity of the relative risks (Zelen, Citation1971).

Several of these statistics have been discussed, evaluated and compared on their performance of detecting heterogeneity but, as Sutton and Higgins (Citation2008) point out, there is no overview of the results (Sutton & Higgins, Citation2008). The Q-statistic has been shown to have low statistical power when the number of studies is small (Higgins & Thompson, Citation2002; Higgins, Thompson, Deeks, & Altman, Citation2003; Huedo-Medina, Sánchez-Meca, Marín-Martínez, & Botella, Citation2006) and to be conservative for a large number of studies (Higgins & Thompson, Citation2002). Moreover, the application of the Q-statistic to test the homogeneity of treatment effects has come under criticism recently (Hoaglin, Citation2016). Hoaglin (Citation2016) points out that Cochran did not use the Q-statistic itself as a test statistic and that the Q-statistic only approaches a chi-square distribution under the null hypothesis asymptotically when within-study sample size increases (since the individual terms will become approximately normally distributed). In addition, the assumptions of normality of effect sizes and that the within-study variances are known do not apply to many meta-analyses. This is especially the case when the effect sizes are risk differences, the logarithm of the risk differences or the logarithm of the odds ratios. Hoaglin’s concerns also apply directly to the widely used I2 as well (Hoaglin, Citation2016). Jackson (Citation2006) derived a formula to calculate the power of the homogeneity test when using the Q-statistic. The author only considered the case when the number of studies is odd and the studies have the same within-variance, and concluded that the homogeneity test based on the Q-statistic indeed had low power. Kulinskaya, Dollinger and Bjørkestøl (Citation2011) improved the performance of the Q-statistic using two approximations—a Gamma distribution based on expanded first two moments of the Q-statistic and a chi-square distribution with fractional degrees of freedom matching the first moment—and recommended using the latter approximation (Kulinskaya et al., Citation2011). Sánchez-Meca and Marín-Martínez (Citation1997) compared the performance of the Q-statistic with the Schmidt and Hunter rule for homogeneity on a number of continuous outcomes (Sánchez-Meca & Marín-Martínez, Citation1997). The outcomes that were investigated were the standardized mean difference, the point-biserial correlation coefficient and the standardized correlation coefficient. The Schmidt and Hunter rule rejects the homogeneity of treatment effects if the sampling error variance is less than a specific threshold of the total variance (e.g. 75% or 90%). Sanchez-Meca et al. concluded that all tests had low statistical power, the Q-statistic resulted in nominal Type 1 errors while the Schmidt and Hunter rule overestimated the Type 1 errors (Sánchez-Meca & Marín-Martínez, Citation1997). Jones, O’Gorman, Lemke, Woolson and Monte Carlo (Citation1989) examined the performances of the following tests: likelihood ratio test, Pearson’s chi-square test, Breslow–Day test (Jones et al., Citation1989) and Tarone’s adjustment to it (Tarone, Citation1985), a conditional score test and Liang and Self’s normal approximation to the score test (Liang & Self, Citation1989). They found that the likelihood ratio test was too liberal. They recommended using the Breslow–Day statistic when the number of studies is large, and Liang and Self’s normal approximation to the score test when the data are sparse, i.e. when there are few participants in some studies. Similar results were also reached by Paul and Donner (Citation1992). Jones et al. (Citation1989) concluded that all tests considered had low power when the number of studies is small (Jones et al., Citation1989). This was supported by Gavaghan, Moore, and McQuay (Citation2000) who examined the Peto statistic (Yusuf et al., Citation1986), the Woolf statistic (Woolf, Citation1955), the Q-statistic (applied to the estimates of the risk difference) (DerSimonian & Laird, Citation1986), Liang and Self’s normal approximation to the score test, and the Breslow–Day test. These tests were also found to have incorrect Type 1 error rates in case of truly homogeneous sets. The authors concluded that the Q-statistic was too liberal and that the Woolf statistic was conservative, and they recommended the Breslow–Day test for routine use (Gavaghan et al., Citation2000). Recently, the Cochrane Reviews have started including I2 in its reviews (Breslow, Citation1994). However, the I2 statistic has been shown to be conservative and has a low power for continuous outcomes when the number of studies is small (Huedo-Medina et al., Citation2006). Bagheri, Ayatollahi and Jafari (Citation2011) compared the performance of the random-effects logistic regression model (Bagheri et al., Citation2011) to Cochran’s Q-statistic and the Breslow–Day test, and found that the Breslow–Day test had the highest power. The authors recommended using the random-effects logistic regression model citing practical advantages, though these models are known for their computational difficulties (Wolfinger & O’ Connell, Citation1993).

This overview demonstrates that a lot of work has already been done. However, some measures have never been investigated (e.g. Bliss statistic), and not all measures have been investigated in all settings. More importantly, all studies assumed that the underlying binomial probabilities for the treatments in the stratified contingency tables would be satisfied and used this in their simulations, while in practical situations this would typically be violated. The aggregate data in a contingency table would be generated by individuals all having a different probability of obtaining the clinical outcome, due to other variables like age and gender. It is unclear whether the above-mentioned statistical methods will remain robust in more realistic practical settings. We use an extensive simulation study trying to account for hidden variations that can occur in practice, and we study the performance of 10 methods, including statistical methods that have not been applied before for this purpose.

This article is organized as follows. In Section 2 the statistical approaches for testing the homogeneity of odds ratios are presented, assuming that there does not exist individual risk. Section 3 describes the simulation model used where individual risk scores are being simulated. In Section 4 we present the simulation results and a case study on tuberculosis to illustrate the methods. The final section provides a discussion of the results and recommendation for use of the methods and possible further research.

2. Statistical models

Before we introduce the methods, we first introduce notation that will be used throughout this section. The binary exposure or treatment will be referred to as just treatment and the clinical outcome is considered binary. Then let Xhi be the number of successes for treatment h (=0,1) in study i (=1,2,...,m) out of nhi trials. The control group is indexed by h=0 while the treatment group is indexed by h=1. The random variables Xh1,Xh2,,Xhm are assumed to be mutually independent (independent studies). For the methods that test homogeneity of effect sizes, it is assumed that the expected value of Xhi is equal to EXhi=nhiphi, with phi being the proportion of a success for treatment h in study i. All methods for testing homogeneity will assume that the proportion phi satisfy the following form phi=Fαi+β+γithi, with F being some kind of a function (typically Logistic), αi an intercept for study i, β the (mean) treatment effect, thi a treatment indicator variable for study i with value 1 when h=1 and 0 otherwise, and γi an interaction effect for study i. Then homogeneity means that

(1) H0:γ1=γ2==γm=0(1)

when the parameters γis are considered fixed or

(2) H0:varγi=0(2)

when they are being assumed random.

Note that in practice the analysis of meta-analysis could also include one or more available covariates, e.g. average age per study or percentage males or females. In this setting, however, we assume that the only available information is the number of successes and the number of units in the treatment and control groups. This way the statistical methods discussed below can be applied to data with the least available information. Many of these methods can be easily generalized to accommodate extra covariates.

2.1. Logistic regression approaches

For fixed-effects logistic regression analysis, the function F is a distribution function of the form F(x)=exp(x)/1+exp(x). We also need to impose a constraint to obtain identifiability of the model. A natural choice is to take γ1=0 or γm=0. Additionally, logistic regression implies that Xhi is binomial and that X0i and X1i are independent (this is often not true due to different risks for individuals, but without additional information Xhi is assumed binomial).

To test (1) we apply the likelihood ratio test. The likelihood of the full model for F being the logistic distribution function is

lαi,β,γi=i=1mXiαi+X1iβi+log1Fαi+log1Fαi+βi

with Xi = X0i + X1i being the total number of events and βi=β+γi. The maximum likelihood estimates are denoted by αˆiF, βˆF and γˆiF for the full model and αˆi0 and βˆ0 under the null hypothesis. The likelihood ratio test is now defined by

LRT=2lαˆi0,βˆ0,0lαˆiF,βˆF,γˆiF

The LRT has an approximate χ2 distribution with m1 degrees of freedom [28]. The LRT statistic can be determined with standard software packages (we applied the GENMOD procedure of SAS).

Instead of a fixed-effects model, the interaction effects γi can be taken random for the logistic link function. Under this assumption, it is common to also assume that the intercepts αi are random. Thus, we will assume that αi,γiT are bivariate normal with mean α,0T and variance–covariance matrix Σ given by

Σ=σC2ρσCσCTρσCσCTσCT2,

with σC2 being the standard deviation for the random intercepts, σCT2 the standard deviation for the random interaction term and ρ the correlation between αi and γi.

The number of events Xhi, conditionally on αi and γi, is considered binomial with parameters nhi and phi=Fαi+β+γithi and X0i and X1i are being independent given αi and γi. The log likelihood of the full model can be written as:

lαi,β,γi=i=1mlogRh=12nhiXhiFz0+z1xhiXhi1Fz0+z1xhinhiXhiΦ2z0α,z1β | Σdz0dz1

with Φ2z0,z1| being the bivariate normal density given by

12πσCσCT1ρ2exp121ρ2z02σC22ρz0z1σCσCT+z12σCT2

Homogeneity hypothesis in (2) can be evaluated using a likelihood ratio test again. The LRT becomes

LRT=2lαˆR0,βˆR0,ΣˆR0lαˆRF,βˆRF,ΣˆRF

with αˆRF,βˆRF and ΣˆRF being the maximum likelihood estimators under the full model and αˆR0,βˆR0 and ΣˆR0 being the maximum likelihood estimators under the null hypothesis. Note that the variance–covariance matrix Σ becomes

Σ=σC2000

when the null hypothesis would be true. The approximate asymptotic distribution of the LRT is a mixture of a chi-square distribution with two degrees of freedom and a degenerate distribution in zero. This means that we should compare the LRT statistic with a chi-square value at significance level 2α (McCulloch, Searle, & Neuhaus, Citation2008; Verbeke & Molenberghs, Citation2003). The LRT statistic can be determined with standard software packages (we applied the NLMIXED procedure of SAS).

2.2. Approaches based on the Q-statistic

The Q-statistic was developed by Cochran (Citation1954) and it can be applied to test the homogeneity of any type of effect size. The Q-statistic is given by Q=i=1mβˆiβˉ2/sˆei2, with βˆi being the estimated effect size, βˉ a weighted average, and sˆei2 an estimated standard error for βˆi. For the log odds ratio of study i, we would have βˆi=logX1in0iX0i/n1iX1iX0i; βˉ is the weighted average which is given by

βˉ=i=1mβˆi/sˆei2/i=1m1/sˆei2

with sˆei2 being given by

sˆei2=1X1i+1n1iX1i+1X0i+1n0iX0i

Cochran showed that the Q-statistic is approximately χ2 distributed with m1 degrees of freedom, although he did not apply or discuss log odds ratios as effect sizes. This was done by Woolf (Citation1955).

Bliss (Citation1952) used the Q-statistic in a slightly different way to test the homogeneity hypothesis. Bliss’ test statistic is given by

B=m1+nˉ4)/nˉ1nˉ2Q/nˉm1

with nˉ=i=1mh=01nhi2/m.

Note that Bliss’ method was originally developed for testing the equality of potencies obtained from multiple bioassays similar to that of Cochran. In such cases, nˉ is the average of the corresponding degrees of freedom over all bioassays. The corresponding degrees of freedom per bioassay is obtained as a result of applying a parallel line bioassay model, i.e. a linear regression model (Mzolo et al., Citation2013). In our case we apply a fixed-effects logistic regression per study, so two parameters are estimated per study. Hence, we have chosen the degrees of freedom per study equal to h=01nhi2. Under the null hypothesis, the B-statistic may be approximated by a χ2 distributed random variable with m1 degrees of freedom.

As mentioned in Section 1, Higgins and Thompson (Citation2002) introduced a few measures for quantifying the heterogeneity of treatment effects that are based on the Q-statistic (Sutton & Higgins, Citation2008). The authors suggested a homogeneity test based on a confidence interval of the I2 statistic, given by I2=100H21/H2 with H2=Q/m1. A confidence interval for H is given by explog H±zα/2sˆelog(H) with zα/2 being the α/2 quantile of the standard normal distribution and

sˆelog(H)=12log(Q)log(m1)/2Q2m3ifQ > m2(m2)13m221/3m22ifQ  m

This confidence interval for H can be transformed to a confidence interval for I2, and the transformed confidence interval can be used to test hypothesis in (1) by verifying whether the value zero lies in the interval or not.

2.3. Chi-square approaches

One of the oldest approaches for testing homogeneity of odds ratio is the Breslow–Day test (Breslow & Day, Citation1980). They used the Cochran–Mantel–Haenszel pooled odds ratio ORCMH calculated by

ORCMH=i=1mX1in0iX0i/nii=1mX0in1iX1i/ni

and assumed the same distributional assumptions for Xhi as for the fixed-effects logistic regression analysis. Now define E1i=EX1i|Xi,ORCMH as the expected value of X1i given Xi=X0i+X1i and ORCMH under the assumption that the null hypothesis in (1) is correct. E1i can be obtained by solving the following quadratic equation:

(3) ORCMH1E1i2Xi+n1iORCMH+n0iXiE1i+Xin1iORCMH=0(3)

For testing the homogeneity, the Tarone test statistic (Tarone, Citation1985) is used. This statistic is given by

(4) mi=1X1iE1i2varX1i|Xi,ORCMHX1E12mi=1varX1i|Xi,ORCMH(4)

with X1=i=1mX1i, E1=i=1mE1i and varX1i|Xi,ORCMH is given by

(5) varX1i|Xi,ORCMH=1E1i+1XiE1i+1n1iE1i+1n0iXi+E1i1(5)

Note that we applied the Tarone statistic instead of Breslow–Day statistic. The Breslow–Day statistic is based on the Cochran–Mantal–Haenszel odds ratio estimator. This estimator has been shown to be consistent but not efficient. For this reason Tarone modified the Breslow–Day statistic (Tarone, Citation1985). If (1) is true, the statistic in (4) approximately follows a χ2 distribution with m1 degrees of freedom. The Tarone-adjusted Breslow–Day test can be carried out using standard software packages (we applied the procedure FREQ of SAS).

Zelen (Citation1971) also presented a statistic to test the homogeneity of odds ratios (Zelen, Citation1971). However, it was found to be biased and inconsistent (Halperin et al., Citation1977). Instead, Halperin et al. (Citation1977) presented a corrected version of the Zelen statistic, which we will use. It also uses an expected value E1i, like Breslow–Day, but now with an alternative estimate of the pooled odds ratio. The odds ratio ORCZ is given by ORCZ=expβˆ0 with βˆ0 being the maximum likelihood estimator of the Fixed-effects logistic regression analysis under the null hypothesis of homogeneity. The expected value E1i=EX1i|Xi,ORCZ can be obtained by solving Equation (3) with ORCMH being replaced by ORCZ.

Then the corrected Zelen statistic for testing homogeneity is given by

Z=i=1mX1iE1i2/varX1i|Xi,ORCZ

with the variance given in (5) using the solution E1i=EX1i|Xi,ORCZ. If the null hypothesis in (1) is true, the corrected Z statistic will follow an asymptotic χ2 distribution with m1 degrees of freedom (Halperin et al., Citation1977).

Liang and Self (Citation1989) presented test statistics for sparse data (Liang & Self, Citation1989). (Recall that sparse data have low cell counts, thus not necessarily low number of studies.) Again, an expected value E1i is used given Xi and a pooled odds ratio ORCL. Here OˆRCL is given by ORCL=exp βˆCL0 with βˆCL0 being the maximum likelihood estimator of the conditional likelihood function given Xi and γ1=γ2==γm=0. The conditional logistic regression can be obtained using standard software packages. (We applied the procedure LOGISTIC of SAS.) The expected value E1i=EX1i|Xi,ORCL can be obtained by solving Equation (3) with ORCMH being replaced by ORCL. Liang and Self test statistic T2 is defined by

T2=i=1mX1iE1i2/varX1i|Xi,ORCL

with varX1i|Xi,ORCL being given by (5) and E1i being the described solution of (3) using ORCL. The statistic T2 is approximately chi-squared distributed with m1 degrees of freedom (Liang & Self, Citation1989).

Following Liang and Self’s approach, we derived an alternative statistic. Instead of using the conditional logistic regression, we apply a random-effects logistic regression model, assuming that varγi=0. The odds ratio we apply is ORR=expβˆR0 with βˆR0 being the maximum likelihood estimator described in Section 2.1. The test statistic TR is equal to T2 but with E1i the solution of (3) with ORCMH replaced by ORR and the variance given by (4), with E1i the described solution of (3) using ORR.

Finally, a complete alternative χ2 statistic is the Peto statistic. Yusuf et al. (Citation1986) presented the Peto statistic to test the homogeneity of the odds ratios [10]. The Peto statistic P is given by

P=i=1mOiEi2Vii=1mOiEi2i=1mVi

where Oi=X1i, Ei=Xin1i/ni, ni=n1i+n0i and Vi defined by

Vi=XiniXin1in0ini2ni1

When the null hypothesis in (1) is true, the Peto statistic will have an approximate χ2 distribution with m1 degrees of freedom (Yusuf et al., Citation1986).

3. Simulation

3.1. Introduction

Although all approaches discussed above assume equal risk probabilities for all participants in one study for one treatment group, this assumption is hardly true in practice. In reality, risk probabilities of individuals will be influenced by individual’s characteristics. Therefore, the data are simulated so that each individual has its unique risk probability. This is done by creating covariates for each individual that affect the risk probability in addition to the influence of the treatment. Since populations across studies vary, the distributions of the covariates are affected by studies, causing already one form of treatment–study interaction. We also added treatment–study interactions independent of the covariates.

3.2. Simulation model and settings

The simulation model to generate m studies can be described as follows. First, the total number of participants ni for study i is drawn. We used an overdispersed Poisson distribution with parameter λ. The value γiΓa0,b0 is drawn to make a study-specific parameter λi=λexp0.5γi. Then ni is drawn from a Poisson distribution λi. As mentioned in Section 1, to make the simulation somewhat more realistic we generated covariates, in this case age and gender. The covariate age aij for the jth participant in study i is created by

aij=α0+ηi+εij,

with α0 being the average age over all participants in all studies, ηiN0;τs2 a study effect for age and εijN0,τa(s)2 the age effect within study i. The covariate gender gij for the jth participant in study i is generated with a uniform variable zij. The jth participant will be assigned the value gij=1 if zij > 0.5 and zero otherwise.

Then we need to generate study-specific treatment effects independent of covariates. We use a bivariate normally distributed variable u0i,u1i such that

u0iu1iNαα+β;σ02ρσ0σ1ρσ0σ1σ12

The parameter α represents an intercept to generate the risk of an event for the control group and β is the direct treatment effect. Additionally, the random variable vij is drawn from a normal distribution, vijN(0,σ2), to generate heterogeneity across participants in the study due to variables other than age and gender.

A treatment indicator δij is generated by a uniform 0,1 distribution. If wij is uniformaly distributed, then δij=1 if wij > p and zero otherwise, with p being the ratio of allocation of control and treatment. The event Xij is Bernoulli pij distributed with pij satisfying

(6) logpij/1pij=αa+βaδijaij+αggij+uij+vij(6)

with αa being the effect of age on the risk of an event irrespective of treatment, βa a treatment age interaction effect, αg the effect of gender and uij=u011δij+u11δij the study treatment effect. Finally, the dichotomous response variable Xij is drawn with a uniform 0,1 distribution, i.e. Xij=1 if yij > pij and zero otherwise, with yijU0,1. This procedure is repeated 1,000 times resulting in 1,000 simulation runs for any set of parameters. As per simulation run, it is determined whether the null hypothesis of homogeneity is rejected or not using the approaches in Section 2 which all ignore the individual heterogeneities. The percentage of rejected null hypotheses over the number of simulations is being reported.

Note that the final model in 6 has a study-specific latent variable Z0i=αaηi+u0i for the control group, and a study-specific latent variable Z1i=αa+βaηi+u1i for the treatment group. The variance–covariance matrix for these latent variables is given by

αa2τs2+σ02αaαa+βaτs2+ρσ0σ1αaαa+βaτs2+ρσ0σ1αa+βa2τs2+σ12

when βa=0, ρ=1 and σ0=σ1; the correlation between Z0i and Z1i is equal to 1 and there is no treatment–study interaction. In case βa0 or ρ1, there is a treatment–study interaction and thus heterogeneity in treatment effects. The size of the heterogeneity is given by βa2τs2+2σ021ρ, when σ0=σ1. The larger this value, the higher the statistical power of the methods. Thus, the heterogeneity reduces when ρ increases to 1. For the simulation, a choice was made to take σ02=σ12. Type I error was simulated assuming ρ=1 and βa=0. The different simulation settings that we studied are presented in Table .

4. Results

4.1. Meta-analysis of antituberculosis therapy

We applied the statistical methods presented in Section 2 to a meta-analysis on antituberculosis therapy (Pasipanodya, Srivastava, & Gumbo, Citation2012). This meta-analysis consisted of 26 different regimens (plans or regulated courses). In each regimen, slow acetylators (control) and rapid acetylators (treatment) were applied to two groups of patients. The objective was to test whether microbiological failure after therapy, relapse after therapy and acquired drug resistance (ADR) were as likely to occur among slow and rapid acetylators. The authors limited their selection of studies to randomized, controlled clinical trials in which outcomes were reported in table, graph or text format. All patients who were considered microbiological failures, relapses or ADR during treatment for slow and rapid acetylators were recorded. For each of the three different outcomes, the authors then calculated risk ratios (RR) with 95% confidence intervals using regimen as unit of analysis. Random-effects models were used to combine regimens when I2>50% and the corresponding P-value <0.05; otherwise fixed-effects models were used to combine the RR.

For our study we only consider the results of one outcome, namely microbiological failure. The authors performed a test of homogeneity of the relative risks using the I2 and found I2=39% with a corresponding P=0.035 (Pasipanodya et al., Citation2012). The authors then applied mixed-effects modeling for this outcome, resulting in a combined RR of 2, with a confidence interval [1.5 ; 2.7] for the RR.

However, four of the 26 regimens contained zero cells, and the selected methods cannot deal with this unless a correction to the number of cells is made (Gart, Citation1966). We chose to remove these four regimens from our data, and apply all the selected approaches to the remaining datasets.

The test statistic, degrees of freedom and the P-value for rejecting the null hypothesis are presented in Table . Using 5% level of significance, the null hypothesis is only rejected by the random-effects logistic regression model. Additionally, the P-values range from 0.035 to 0.124. Thus, the proposed methods do not all seem to agree.

4.2. Simulation results

The full set of results is presented in the supplementary material. A subset of these results is shown in Tables . Tables present a subset of the simulation results where phi depends on age, gender and other heterogeneous factors vij. They represent the results for the most realistic practical settings to contrast these results and to be able to compare our results with literature. Table presents a selection of the Type I error and the statistical power results of the test methods under the assumption of the binomial distribution. In such case, the proportion of a success, phi, does not depend on age, gender or any other participant variables vij. The random-effects logistic regression analysis occasionally showed numerical problems, in particular when heterogeneity between studies was small or not present at all. Note that in around 13.7% of all simulation runs the full random-effects logistic regression model could not converge and the results of these problematic simulation runs are eliminated from the results.

Table 1. Parameter values used for a simulation model

Table 2. Results of meta-analysis on the pharmacokinetic variability hypothesis for ADR and failure of antituberculosis therapy

Table 3. Type I error rate for m=5, aa=0.05, ag=0.1

Table 4. Statistical power for ρ=0.3, m=5, l=100, aa=0.05, ag=0.1

Table 5. Statistical power for r=0.6, m=5, l=100,aa=0.05, ag=0.1

Table 6. Type I error and statistical power for aa=ag=σ2=0, m=5, l=100

For the realistic practical settings (Table ), the fixed-effects logistic regression analysis has a slight increased Type I error, when the number of participants within studies is relatively small, irrespective of the number of studies, but this is still acceptable. When sample sizes within studies increase, the Type I error is closer to nominal. The Q-statistic, the Bliss statistic and the I2 statistic constantly underestimate the Type I error (Table ). This is most apparent when only a third of the participants are exposed to the treatment. This gets slightly better when the number of studies increases (supplementary material). The I2 statistic has the smallest Type I error rates of all the methods, except for a few cases where the Peto statistic gives even smaller values. The random-effects logistic regression model mostly underestimates the Type I error rate when σ02 and σ12 are small. However as σ02 and σ12 increase, it overestimates the Type I error rate (Table ). This gets even worse when the number of studies increases (supplementary material). The Type I error rates of the chi-square methods using estimates of the odds ratio, i.e. the Breslow–Day statistic, the corrected Zelen statistic, Liang and Self’s T2 and its adjusted version TR, are almost at the nominal value. However, when one-third of the participants is exposed to treatment, the chi-square methods stay just below the nominal level, but this disappears when the number of studies increases. When participants no longer have their individual risk, the Type I error rates (Table ) are mostly similar to the Type I error rates for the realistic practical settings, except for two cases. The fixed-effects logistic regression no longer gives liberal Type I error rates when there is no treatment effect β=0.

Investigating the power for the realistic practical settings (Tables and ), it obviously shows that the statistical power increases as the number of studies increases. This improved power for testing the interaction effect can be explained by the reduced sampling error due to increased sample size. Furthermore, the increase in σ02 and σ12 results in higher power since they increase the treatment interaction effects. This is due to the term 2σ021ρ in the size of heterogeneity. The statistical power is negatively correlated with ρ, with no interaction effect when ρ=1 and βa=0 (as mentioned earlier). The highest statistical power is obtained by the fixed-effects logistic regression model, followed by the χ2 statistics (Breslow–Day test, the corrected Zelen test, T2 and TR). The lowest statistical power is obtained by the I2 and the random-effects logistic regression model, respectively. These results hold true irrespective of the number of studies. Although Bliss uses the Q-statistic differently from Cochran, the two tests behave almost identically but they are less powerful than the best test statistics. It should be noted that heterogeneity in treatment effects from differences in age effects across studies βa>0 is not increasing the power a lot, a result that holds even when the number of studies increases. This is due to the selected settings. The heterogeneity effect in the simulation was only increased with βa2τa2=0.01210=0.001, which is really small.

Comparing the power results from the setting that participants have a common risk (Table ) with the power results from the setting that individuals have their own individual risks (Tables and ) shows a few interesting differences. First, the power is substantially smaller when individuals are heterogeneous in their risk rates, which makes it more difficult to detect treatment heterogeneity across studies. The power values for testing treatment interactions increase slightly when there would be a treatment effect in case individuals have heterogeneous risk rates. However, this is opposite when individuals are not heterogeneous, since the power values for treatment heterogeneity is substantially smaller for β=2 than for β=0 (Table and supplementary material). With respect to choosing the best methods for detecting treatment effect heterogeneity, no real difference could be detected.

5. Discussion

The purpose of this article was to investigate the performance of statistical methods that may test for the homogeneity of odds ratios in meta-analysis. Ten methods were applied to a real-life meta-analysis study and to practically realistic simulated datasets. The simulated datasets were created using different scenarios based on different numbers of studies, participants within studies, unbalanced number of observations for treatment and control groups, whether or not there is an actual treatment effect, individual risks through covariates and the level of heterogeneity.

The Peto statistic is somewhat unreliable with respect to its Type I error rate, a conclusion that has been reached earlier (Gavaghan et al., Citation2000). The fixed-effects logistic regression analysis is a slightly liberal test, although the inflated Type I error rate is not too large, and it reduces with larger number of studies. This conclusion has been reached earlier (Paul & Donner, Citation1992), although we could not demonstrate it for settings where all individuals have the same constant risk rates, the settings used in the literature. In our study the inflation happened when the proportion of success depends on individual covariates and with small numbers of studies. This interesting result merits further investigation to find out the exact analytical cause, but it is beyond the scope of this article.

The fact that the fixed-effects logistic regression has the highest power among all other methods is partly due to this inflated Type I error rate. The I2 statistic can be considered a very conservative test. This conclusion had been reached earlier (Huedo-Medina et al., Citation2006), but only for continuous responses. It makes the I2 an unsuitable test statistic for homogeneity of odds ratios. The random-effects logistic regression analysis is also conservative, albeit less conservative than the I2 for small numbers of studies, but it is less powerful than I2 for a larger number of studies. This low power combined with the unreliable Type I error rate makes the random-effects model unsuitable for testing homogeneity; see also Bagheri et al. (Citation2011). The Q-statistic and the Bliss statistic can also be considered conservative tests, a conclusion reached earlier (Higgins & Thompson, Citation2002; Higgins et al., Citation2003; Huedo-Medina et al., Citation2006). However, heterogeneity in the individual’s risk rates makes it more apparent. The Type 1 error rates of the chi-square methods are close to the nominal value, but Paul and Donner (Citation1989) found T2 to have liberal Type 1 error rates in the setting of a small number of centers and a large number of units. In our study, however, the T2 (and all the χ2 methods) hardly overestimates the Type I error in such setting, but we did not study large numbers of participants within studies. The Type I error rate of T2 is not affected by heterogeneity of individual’s risk rates. Since the allocation of participants to treatment and control groups seems to affect the statistical power, it should be taken into consideration. Additionally, the heterogeneity in individual’s risk rates reduces the power, in particular when there would be no treatment effects present.

Overall, the chi-square tests based on odds ratios and the fixed-effects logistic regression analysis perform best. We recommend the Breslow–Day test adjusted by Tarone for convenience purposes, since it is available in most statistical software packages and the other tests are not. The Breslow–Day test had been recommended before by Jones et al. (Citation1989), but only when the number of studies is large. We also recommend the Breslow–Day test when the number of studies is small. However, when a treatment effect is suspected to be present and we have only a small number of studies (m10), we recommend using the fixed-effects logistic regression over the χ2 tests since it has a slightly higher power than the chi-square test statistics, with an almost nominal Type I error rate. We advise against using the Q-statistic, the Bliss statistic and the I2 statistic because of their poor performance, and due to the recent theoretical worries expressed by Hoaglin (Hoaglin, Citation2016). The Peto statistic and the random-effects logistic regression are not recommended either due to their poor performance.

This article has its limitations. First, the conclusions presented only hold for dichotomous outcomes. Second, although the simulation method creates data on a unit (subject) level, hence adding more variation, it uses normally distributed latent variables. This assumption may not always be realistic in practice. Finite mixture models for the latent variables can be a potentially useful method. More specifically, the beta-binomial distribution (Skellam, Citation1948) could be contemplated to perform the interaction test. Another suggestion is to use the methods that estimate between-study variance; see e.g. Brockwell and Gordon (Citation2001) and DerSimonian and Laird (Citation1986) for an overview. The estimated confidence intervals of the between-study variance can be used to test the homogeneity of effect sizes. Finally, Bayesian approaches to test the interaction effects between study and treatment effect are other options for future research.

Acknowledgments

The authors are very grateful to the reviewer whose comments helped improve the results section and thereby expanding the relevance thereof.

Supplementary material

Supplemental material for this article can be accessed at https://doi.org/10.1080/25742558.2018.1478698.

Additional information

Funding

This work was funded by the Netherlands Organization for Scientific Research [grant number: 023.005.087].

Notes on contributors

Osama Almalik

Edwin van den Heuvel is a full-time professor at the Department of Mathematics and Computer Sciences in the Eindhoven University of Technology and he is the head of the Statistics group within the Stochastics section. The Statistics group develops and compares data-analytical methods for analyzing and sampling complex structured correlated datasets. It includes parameter estimation, model fitting, latent variable models, mixed models, missing data, statistical process control, survival and reliability theory, time series analysis, and statistical learning methods. One of the central themes is the analysis of high-dimensional temporal datasets and other large datasets. The group actively explores new research lines in data science. This current research article is part of a PhD project of Osama Almalik (the principal author) on meta-analysis for non-standard or complex situations under supervision of van den Heuvel. It is considered beneficial to medical researchers, as well as statisticians, when carrying out meta-analysis for combining effect sizes from cohort studies and clinical trials.

References

  • Agresti, A. (2002). Categorical data analysis (2nd ed.). New York, NY: John Wiley & Sons.
  • Bagheri, Z., Ayatollahi, S. M. T., & Jafari, P. (2011). Comparison of three tests of homogeneity of odds ratios in multicenter trials with unequal sample sizes within and among centers. BMC Medical Research Methodology, 11, 58.
  • Breslow, N. E., & Day, N. E. (1980). Statistical methods in cancer research 1980; Volume I—the design and analysis of case-control studies, Vol 32 of IARC Scientific Publications. Lyon: International Agency for Research on Cancer.
  • Bliss, C. L. (1952). The statistics of Bioassay: With special reference to the vitamins. New York, NY: Academic Press Inc.
  • Breslow, N. E. (1994). Statistics in epidemiology: The case-control study. Journal of the American Statistical Association, 91(433), 14–28. doi:10.1080/01621459.1996.10476660
  • Brockwell, S. E., & Gordon, I. R. (2001). A comparison of statistical methods for meta-analysis. Statistics in Medicine, 20(6), 825–840. doi:10.1002/(ISSN)1097-0258
  • Cochran, W. G. (1954). The combination of estimates from different experiments. Biometrics, 10, 101–129. doi:10.2307/3001666
  • DerSimonian, R., & Laird, N. (1986). Meta-analysis in clinical trials. Control Clinical Trials, 7(3), 177–188. doi:10.1016/0197-2456(86)90046-2
  • Gart, J. J. (1966). Alternative analyses of contingency tables. Journal of the Royal Statistical Society Series B (Methodological), 28(1), 164–179.
  • Gavaghan, D. J., Moore, R. A., & McQuay, H. J. (2000). An evaluation of homogeneity tests in meta-analyses in pain using simulations of individual patient data. Pain, 85(3), 415–424. doi:10.1016/S0304-3959(99)00302-4
  • Halperin, M., Ware, J. H., Byar, D. P., Mantel, N., Brown, C. C., Koziol, J., … Green, S. B. (1977). Testing for interaction in an I×J×K contingency table. Biometrika, 64, 271–275.
  • Higgins, J. P. T., & Thompson, S. G. (2002). Quantifying heterogeneity in a meta-analysis. Statistics in Medicine, 21(11), 1539–1558. doi:10.1002/sim.1186
  • Higgins, J. P. T., Thompson, S. G., Deeks, J. J., & Altman, D. G. (2003). Measuring inconsistency in meta-analyses. British Medical Journal, 327(7414), 557–560. doi:10.1136/bmj.327.7414.557
  • Hoaglin, D. C. (2016). Misunderstandings about Q and ‘Cochran’s Q test’ in meta-analysis. Statistics in Medicine, 35, 485–495. doi:10.1002/sim.v35.4
  • Huedo-Medina, T. B., Sánchez-Meca, J., Marín-Martínez, F., & Botella, J. (2006). Assessing heterogeneity in meta-analysis: Q statistic or I2 index. Psychological Methods, 11(2), 193–206. doi:10.1037/1082-989X.11.2.193
  • Jackson, D. (2006). The power of the standard test for the presence of heterogeneity in meta-analysis. Statistics in Medicine, 25, 2688–2699. doi:10.1002/(ISSN)1097-0258
  • Jones, M. P., O’Gorman, T. W., Lemke, J. H., Woolson, R. F., & Monte Carlo, A. (1989). Investigation of homogeneity tests of the odds ratio under various sample size configurations. Biometrics, 45(1), 171–181. doi:10.2307/2532043
  • Kulinskaya, E., Dollinger, M. B., & Bjørkestøl, K. (2011). Testing for homogeneity in meta-analysis. I. The one parameter case: Standardized mean difference. Biometrics, 67, 203–212. doi:10.1111/j.1541-0420.2010.01442.x
  • Liang, K. Y., & Self, S. G. (1989). Tests of homogeneity of odds ratio when the data are sparse. Biometrica, 72, 353–358. doi:10.1093/biomet/72.2.353
  • Mantel, N., & Haenszel, W. (1959). Statistical aspects of the analysis of data from retrospective studies of disease. Journal of the National Cancer Institute, 22, 719–748.
  • McCulloch, C. E., Searle, S. R., & Neuhaus, J. M. (2008). Generalized, linear, and mixed models (2nd ed.). New Jersey: Wiley.
  • Mzolo, T., Hendriks, M., & van den Heuvel, E. (2013). A comparison of statistical methods for combining relative bioactivities from parallel line bioassays. Pharmaceutical Statistics, 12(6), 375–84.
  • Pasipanodya, J. G., Srivastava, S., & Gumbo, T. (2012). Meta-analysis of clinical studies supports the pharmacokinetic variability hypothesis for acquired drug resistance and failure of antituberculosis therapy. Clinical Infectious Diseases, 55(2), 169–177. doi:10.1093/cid/cis353
  • Paul, S. R., & Donner, A. (1992). Small sample performance of tests of homogeneity of odds ratios in K 2 × 2 tables. Statistics in Medicine, 11(2), 159–165. doi:10.1002/sim.4780110203
  • Sánchez-Meca, J., & Marín-Martínez, F. (1997). Homogeneity tests in meta-analysis: A Monte Carlo comparison of statistical power and Type I error. Quality and Quantity, 31, 385–399. doi:10.1023/A:1004298118485
  • Skellam JG. (1948). A probability distribution derived from the binomial distribution by regarding the probability of success as variable between the sets of trials. Journal of the Royal Statistical Society, 10(2), 257–261.
  • Sutton, A. J., & Higgins, J. P. (2008). Recent developments in meta-analysis. Statistics in Medicine, 27(5), 625–650. doi:10.1002/sim.2934
  • Tarone, R. E. (1985). On heterogeneity tests based on efficient scores. Biometrika, 72, 91–95. doi:10.1093/biomet/72.1.91
  • Verbeke, G., & Molenberghs, G. (2003). The use of score tests for inference on variance components. Biometrics, 59, 254–262. doi:10.1111/1541-0420.00032
  • Wolfinger, R., & O’ Connell, M. (1993). Generalized linear mixed models: A pseudo-likelihood approach. Journal of Statistical Computation and Simulation, 4, 233–243. doi:10.1080/00949659308811554
  • Woolf, B. (1955). On estimating the relationship between blood group and disease. Annual Human Genetics, 19, 251–253. doi:10.1111/j.1469-1809.1955.tb01348.x
  • Yusuf, S., Peto, R., Lewis, J., Collins, R., & Sleight, P. (1986). Beta blockade during and after myocardial infarction: An overview of the randomized trials. Progress in Cardiovascular Diseases, 27(5), 335–371. doi:10.1016/S0033-0620(85)80003-7
  • Zelen, M. (1971). The analysis of several 2 × 2 contingency tables. Biometrika, 58, 129–137.