![MathJax Logo](/templates/jsp/_style2/_tandf/pb2/images/math-jax.gif)
Abstract
Meta-analysis is a statistical methodology that combines the outcomes of several independent studies. Either a fixed-effects or a random-effects model is used to combine the results of the independent studies. The choice depends on whether the outcomes can be considered homogeneous or not. Several methods have been developed to test the homogeneity of effect sizes. Although many of them have been compared already, they have not been studied in more realistic practical settings where individuals all have their own risk of an event and a complete overview is lacking. In this article, we investigate the performance of 10 statistical methods to test homogeneity of a binary exposure on a binary clinical outcome, using an extensive simulation study and a real life meta-analysis. We evaluated the Type I error and the statistical power. The fixed-effects regression model for treatment and study interaction was found to be a slightly liberal test, while the Q-statistic and the Bliss statistic can be considered conservative tests. The random-effects regression model, the Peto statistic and the I2 perform rather poorly compared to the other methods. All chi-square tests that are based on the calculation of a specific odds ratio and the fixed-effects logistic regression analysis perform best. Among these chi-square tests, we recommend the Breslow–Day test, but only for convenience purposes, since it is available in most statistical software packages and the other tests are not.
PUBLIC INTEREST STATEMENT
Testing homogeneity of effect sizes (treatment effects, risk factor associations) is an essential part of meta-analysis since it helps us to interpret the consistency of these effects across groups of participants. The testing can be done using various statistical tools, which can result in different conclusions. For this reason we thought it was important to study the performance of the existing statistical tests for homogeneity. Specific comparisons have been done earlier but we also tried to explore the performance of statistical tests that have never been applied to this specific problem, and to adjust existing statistical tests. We studied the performance of 10 different methods to test the homogeneity of effect sizes using a case study and an extensive simulation study. Based on our study, we recommend one specific statistical test for use for meta-analysts when testing the homogeneity since it outperformed the others in most of the settings we studied.
1. Introduction
Pooling (analysis) results (like odds ratios) from multiple studies is common practice in medical, social and biological sciences. It is the main goal in an aggregate data meta-analysis or in an experimental study where some kind of stratification is applied (like multi-center randomized controlled trails). Researchers would typically be interested mainly in the overall or pooled estimate of a binary exposure or treatment effect, regardless of study differences, since it may better represent an estimate for the full target population than the individual studies. For a binary outcome, this means pooling 2 × 2 contingency tables from several studies. A traditional approach for pooling 2 × 2 contingency tables is the Cochran–Mantel–Haenszel statistic (Breslow, Citation1994; Breslow & Day, Citation1980; Mantel & Haenszel, Citation1959). However, this approach requires that the odds ratios across studies are homogeneous. This emphasizes the need for a test on homogeneity of the odds ratios or more generally the homogeneity of treatment effects on the binary clinical outcome. This topic is not new and many different approaches have been developed.
A classic test for homogeneity of effect sizes is the -statistic (Cochran, Citation1954). It creates a weighted effect size using the squared inverse within-study standard error and evaluates if the variation around this pooled effect is larger than that can be expected from the within-study standard errors. It was initially developed for pooling potency estimates from bioassay experiments (Mzolo, Hendriks, & van den Heuvel, Citation2013). A similar approach used in this area is the method of Bliss (Citation1952), which uses the
-statistic but in a different way. Related measures that are used in systematic reviews are the
,
and the
statistics (Higgins & Thompson, Citation2002). The
statistic is also fully determined by the
-statistic and it was initially developed to quantify the amount of heterogeneity instead of testing for homogeneity. The
statistic is just an alternative measure for
, but the
statistic describes the inflation from the random effects to the fixed effects, and it is viewed more as a measure than as a test statistic. However, the three measures of heterogeneity can be used as test statistics using appropriate confidence intervals.
Alternatively, fixed- and random-effects logistic regression models can be applied to test homogeneity. This is performed by including an interaction term between study and treatment, and then applying either a Wald test, a score test, or the Likelihood ratio test for its significance (Agresti, Citation2002).
A method that studies the heterogeneity of the odds ratio directly is the Breslow–Day test adjusted by Tarone (Citation1985). It is some form of Pearson’s chi-square statistic. Other types of chi-square test statistics, which have been developed for specific situations, are Peto statistic (Yusuf, Peto, Lewis, Collins, & Sleight, Citation1986), Liang and Self statistics (Liang & Self, Citation1989) and Zelen statistic (Zelen, Citation1971). Yusuf et al. (Citation1986) originally presented the Peto method for pooling odds ratios, and accompanied it with a homogeneity test (Yusuf et al., Citation1986). Liang and Self (Citation1989) focused on developing test statistics in case data are sparse, i.e. some cells in the study-specific contingency tables contain only a few participants. Zelen developed the Zelen statistic to test the homogeneity of the relative risks (Zelen, Citation1971).
Several of these statistics have been discussed, evaluated and compared on their performance of detecting heterogeneity but, as Sutton and Higgins (Citation2008) point out, there is no overview of the results (Sutton & Higgins, Citation2008). The -statistic has been shown to have low statistical power when the number of studies is small (Higgins & Thompson, Citation2002; Higgins, Thompson, Deeks, & Altman, Citation2003; Huedo-Medina, Sánchez-Meca, Marín-Martínez, & Botella, Citation2006) and to be conservative for a large number of studies (Higgins & Thompson, Citation2002). Moreover, the application of the
-statistic to test the homogeneity of treatment effects has come under criticism recently (Hoaglin, Citation2016). Hoaglin (Citation2016) points out that Cochran did not use the
-statistic itself as a test statistic and that the
-statistic only approaches a chi-square distribution under the null hypothesis asymptotically when within-study sample size increases (since the individual terms will become approximately normally distributed). In addition, the assumptions of normality of effect sizes and that the within-study variances are known do not apply to many meta-analyses. This is especially the case when the effect sizes are risk differences, the logarithm of the risk differences or the logarithm of the odds ratios. Hoaglin’s concerns also apply directly to the widely used
as well (Hoaglin, Citation2016). Jackson (Citation2006) derived a formula to calculate the power of the homogeneity test when using the
-statistic. The author only considered the case when the number of studies is odd and the studies have the same within-variance, and concluded that the homogeneity test based on the
-statistic indeed had low power. Kulinskaya, Dollinger and Bjørkestøl (Citation2011) improved the performance of the
-statistic using two approximations—a Gamma distribution based on expanded first two moments of the
-statistic and a chi-square distribution with fractional degrees of freedom matching the first moment—and recommended using the latter approximation (Kulinskaya et al., Citation2011). Sánchez-Meca and Marín-Martínez (Citation1997) compared the performance of the Q-statistic with the Schmidt and Hunter rule for homogeneity on a number of continuous outcomes (Sánchez-Meca & Marín-Martínez, Citation1997). The outcomes that were investigated were the standardized mean difference, the point-biserial correlation coefficient and the standardized correlation coefficient. The Schmidt and Hunter rule rejects the homogeneity of treatment effects if the sampling error variance is less than a specific threshold of the total variance (e.g. 75% or 90%). Sanchez-Meca et al. concluded that all tests had low statistical power, the Q-statistic resulted in nominal Type 1 errors while the Schmidt and Hunter rule overestimated the Type 1 errors (Sánchez-Meca & Marín-Martínez, Citation1997). Jones, O’Gorman, Lemke, Woolson and Monte Carlo (Citation1989) examined the performances of the following tests: likelihood ratio test, Pearson’s chi-square test, Breslow–Day test (Jones et al., Citation1989) and Tarone’s adjustment to it (Tarone, Citation1985), a conditional score test and Liang and Self’s normal approximation to the score test (Liang & Self, Citation1989). They found that the likelihood ratio test was too liberal. They recommended using the Breslow–Day statistic when the number of studies is large, and Liang and Self’s normal approximation to the score test when the data are sparse, i.e. when there are few participants in some studies. Similar results were also reached by Paul and Donner (Citation1992). Jones et al. (Citation1989) concluded that all tests considered had low power when the number of studies is small (Jones et al., Citation1989). This was supported by Gavaghan, Moore, and McQuay (Citation2000) who examined the Peto statistic (Yusuf et al., Citation1986), the Woolf statistic (Woolf, Citation1955), the Q-statistic (applied to the estimates of the risk difference) (DerSimonian & Laird, Citation1986), Liang and Self’s normal approximation to the score test, and the Breslow–Day test. These tests were also found to have incorrect Type 1 error rates in case of truly homogeneous sets. The authors concluded that the Q-statistic was too liberal and that the Woolf statistic was conservative, and they recommended the Breslow–Day test for routine use (Gavaghan et al., Citation2000). Recently, the Cochrane Reviews have started including
in its reviews (Breslow, Citation1994). However, the
statistic has been shown to be conservative and has a low power for continuous outcomes when the number of studies is small (Huedo-Medina et al., Citation2006). Bagheri, Ayatollahi and Jafari (Citation2011) compared the performance of the random-effects logistic regression model (Bagheri et al., Citation2011) to Cochran’s Q-statistic and the Breslow–Day test, and found that the Breslow–Day test had the highest power. The authors recommended using the random-effects logistic regression model citing practical advantages, though these models are known for their computational difficulties (Wolfinger & O’ Connell, Citation1993).
This overview demonstrates that a lot of work has already been done. However, some measures have never been investigated (e.g. Bliss statistic), and not all measures have been investigated in all settings. More importantly, all studies assumed that the underlying binomial probabilities for the treatments in the stratified contingency tables would be satisfied and used this in their simulations, while in practical situations this would typically be violated. The aggregate data in a contingency table would be generated by individuals all having a different probability of obtaining the clinical outcome, due to other variables like age and gender. It is unclear whether the above-mentioned statistical methods will remain robust in more realistic practical settings. We use an extensive simulation study trying to account for hidden variations that can occur in practice, and we study the performance of 10 methods, including statistical methods that have not been applied before for this purpose.
This article is organized as follows. In Section 2 the statistical approaches for testing the homogeneity of odds ratios are presented, assuming that there does not exist individual risk. Section 3 describes the simulation model used where individual risk scores are being simulated. In Section 4 we present the simulation results and a case study on tuberculosis to illustrate the methods. The final section provides a discussion of the results and recommendation for use of the methods and possible further research.
2. Statistical models
Before we introduce the methods, we first introduce notation that will be used throughout this section. The binary exposure or treatment will be referred to as just treatment and the clinical outcome is considered binary. Then let be the number of successes for treatment
(
) in study
(
) out of
trials. The control group is indexed by
while the treatment group is indexed by
. The random variables
are assumed to be mutually independent (independent studies). For the methods that test homogeneity of effect sizes, it is assumed that the expected value of
is equal to
, with
being the proportion of a success for treatment
in study
. All methods for testing homogeneity will assume that the proportion
satisfy the following form
, with
being some kind of a function (typically Logistic),
an intercept for study
,
the (mean) treatment effect,
a treatment indicator variable for study
with value 1 when
and 0 otherwise, and
an interaction effect for study
. Then homogeneity means that
when the parameters are considered fixed or
when they are being assumed random.
Note that in practice the analysis of meta-analysis could also include one or more available covariates, e.g. average age per study or percentage males or females. In this setting, however, we assume that the only available information is the number of successes and the number of units in the treatment and control groups. This way the statistical methods discussed below can be applied to data with the least available information. Many of these methods can be easily generalized to accommodate extra covariates.
2.1. Logistic regression approaches
For fixed-effects logistic regression analysis, the function is a distribution function of the form
. We also need to impose a constraint to obtain identifiability of the model. A natural choice is to take
or
. Additionally, logistic regression implies that
is binomial and that
and
are independent (this is often not true due to different risks for individuals, but without additional information
is assumed binomial).
To test (1) we apply the likelihood ratio test. The likelihood of the full model for being the logistic distribution function is
with Xi = X0i + X1i being the total number of events and . The maximum likelihood estimates are denoted by
,
and
for the full model and
and
under the null hypothesis. The likelihood ratio test is now defined by
The LRT has an approximate distribution with
degrees of freedom [28]. The LRT statistic can be determined with standard software packages (we applied the GENMOD procedure of SAS).
Instead of a fixed-effects model, the interaction effects can be taken random for the logistic link function. Under this assumption, it is common to also assume that the intercepts
are random. Thus, we will assume that
are bivariate normal with mean
and variance–covariance matrix
given by
with being the standard deviation for the random intercepts,
the standard deviation for the random interaction term and
the correlation between
and
.
The number of events , conditionally on
and
, is considered binomial with parameters
and
and
and
are being independent given
and
. The log likelihood of the full model can be written as:
with being the bivariate normal density given by
Homogeneity hypothesis in (2) can be evaluated using a likelihood ratio test again. The LRT becomes
with and
being the maximum likelihood estimators under the full model and
and
being the maximum likelihood estimators under the null hypothesis. Note that the variance–covariance matrix
becomes
when the null hypothesis would be true. The approximate asymptotic distribution of the LRT is a mixture of a chi-square distribution with two degrees of freedom and a degenerate distribution in zero. This means that we should compare the LRT statistic with a chi-square value at significance level (McCulloch, Searle, & Neuhaus, Citation2008; Verbeke & Molenberghs, Citation2003). The LRT statistic can be determined with standard software packages (we applied the NLMIXED procedure of SAS).
2.2. Approaches based on the Q-statistic
The Q-statistic was developed by Cochran (Citation1954) and it can be applied to test the homogeneity of any type of effect size. The Q-statistic is given by , with
being the estimated effect size,
a weighted average, and
an estimated standard error for
. For the log odds ratio of study
, we would have
;
is the weighted average which is given by
with being given by
Cochran showed that the Q-statistic is approximately distributed with
degrees of freedom, although he did not apply or discuss log odds ratios as effect sizes. This was done by Woolf (Citation1955).
Bliss (Citation1952) used the Q-statistic in a slightly different way to test the homogeneity hypothesis. Bliss’ test statistic is given by
with .
Note that Bliss’ method was originally developed for testing the equality of potencies obtained from multiple bioassays similar to that of Cochran. In such cases, is the average of the corresponding degrees of freedom over all bioassays. The corresponding degrees of freedom per bioassay is obtained as a result of applying a parallel line bioassay model, i.e. a linear regression model (Mzolo et al., Citation2013). In our case we apply a fixed-effects logistic regression per study, so two parameters are estimated per study. Hence, we have chosen the degrees of freedom per study equal to
. Under the null hypothesis, the B-statistic may be approximated by a
distributed random variable with
degrees of freedom.
As mentioned in Section 1, Higgins and Thompson (Citation2002) introduced a few measures for quantifying the heterogeneity of treatment effects that are based on the -statistic (Sutton & Higgins, Citation2008). The authors suggested a homogeneity test based on a confidence interval of the
statistic, given by
with
. A confidence interval for
is given by
with
being the
quantile of the standard normal distribution and
This confidence interval for can be transformed to a confidence interval for
, and the transformed confidence interval can be used to test hypothesis in (1) by verifying whether the value zero lies in the interval or not.
2.3. Chi-square approaches
One of the oldest approaches for testing homogeneity of odds ratio is the Breslow–Day test (Breslow & Day, Citation1980). They used the Cochran–Mantel–Haenszel pooled odds ratio calculated by
and assumed the same distributional assumptions for as for the fixed-effects logistic regression analysis. Now define
as the expected value of
given
and
under the assumption that the null hypothesis in (1) is correct.
can be obtained by solving the following quadratic equation:
For testing the homogeneity, the Tarone test statistic (Tarone, Citation1985) is used. This statistic is given by
with ,
and
is given by
Note that we applied the Tarone statistic instead of Breslow–Day statistic. The Breslow–Day statistic is based on the Cochran–Mantal–Haenszel odds ratio estimator. This estimator has been shown to be consistent but not efficient. For this reason Tarone modified the Breslow–Day statistic (Tarone, Citation1985). If (1) is true, the statistic in (4) approximately follows a distribution with
degrees of freedom. The Tarone-adjusted Breslow–Day test can be carried out using standard software packages (we applied the procedure FREQ of SAS).
Zelen (Citation1971) also presented a statistic to test the homogeneity of odds ratios (Zelen, Citation1971). However, it was found to be biased and inconsistent (Halperin et al., Citation1977). Instead, Halperin et al. (Citation1977) presented a corrected version of the Zelen statistic, which we will use. It also uses an expected value , like Breslow–Day, but now with an alternative estimate of the pooled odds ratio. The odds ratio
is given by
with
being the maximum likelihood estimator of the Fixed-effects logistic regression analysis under the null hypothesis of homogeneity. The expected value
can be obtained by solving Equation (3) with
being replaced by
.
Then the corrected Zelen statistic for testing homogeneity is given by
with the variance given in (5) using the solution . If the null hypothesis in (1) is true, the corrected
statistic will follow an asymptotic
distribution with
degrees of freedom (Halperin et al., Citation1977).
Liang and Self (Citation1989) presented test statistics for sparse data (Liang & Self, Citation1989). (Recall that sparse data have low cell counts, thus not necessarily low number of studies.) Again, an expected value is used given
and a pooled odds ratio
. Here
is given by
with
being the maximum likelihood estimator of the conditional likelihood function given
and
. The conditional logistic regression can be obtained using standard software packages. (We applied the procedure LOGISTIC of SAS.) The expected value
can be obtained by solving Equation (3) with
being replaced by
. Liang and Self test statistic
is defined by
with being given by (5) and
being the described solution of (3) using
. The statistic
is approximately chi-squared distributed with
degrees of freedom (Liang & Self, Citation1989).
Following Liang and Self’s approach, we derived an alternative statistic. Instead of using the conditional logistic regression, we apply a random-effects logistic regression model, assuming that . The odds ratio we apply is
with
being the maximum likelihood estimator described in Section 2.1. The test statistic
is equal to
but with
the solution of (3) with
replaced by
and the variance given by (4), with
the described solution of (3) using
.
Finally, a complete alternative statistic is the Peto statistic. Yusuf et al. (Citation1986) presented the Peto statistic to test the homogeneity of the odds ratios [10]. The Peto statistic
is given by
where
,
and
defined by
When the null hypothesis in (1) is true, the Peto statistic will have an approximate distribution with
degrees of freedom (Yusuf et al., Citation1986).
3. Simulation
3.1. Introduction
Although all approaches discussed above assume equal risk probabilities for all participants in one study for one treatment group, this assumption is hardly true in practice. In reality, risk probabilities of individuals will be influenced by individual’s characteristics. Therefore, the data are simulated so that each individual has its unique risk probability. This is done by creating covariates for each individual that affect the risk probability in addition to the influence of the treatment. Since populations across studies vary, the distributions of the covariates are affected by studies, causing already one form of treatment–study interaction. We also added treatment–study interactions independent of the covariates.
3.2. Simulation model and settings
The simulation model to generate studies can be described as follows. First, the total number of participants
for study
is drawn. We used an overdispersed Poisson distribution with parameter
. The value
is drawn to make a study-specific parameter
. Then
is drawn from a Poisson distribution
. As mentioned in Section 1, to make the simulation somewhat more realistic we generated covariates, in this case age and gender. The covariate age
for the jth participant in study
is created by
with being the average age over all participants in all studies,
a study effect for age and
the age effect within study
. The covariate gender
for the
participant in study i is generated with a uniform variable
. The jth participant will be assigned the value
if
and zero otherwise.
Then we need to generate study-specific treatment effects independent of covariates. We use a bivariate normally distributed variable such that
The parameter represents an intercept to generate the risk of an event for the control group and
is the direct treatment effect. Additionally, the random variable
is drawn from a normal distribution,
, to generate heterogeneity across participants in the study due to variables other than age and gender.
A treatment indicator is generated by a uniform
distribution. If
is uniformaly distributed, then
if
and zero otherwise, with
being the ratio of allocation of control and treatment. The event
is Bernoulli
distributed with
satisfying
with being the effect of age on the risk of an event irrespective of treatment,
a treatment age interaction effect,
the effect of gender and
the study treatment effect. Finally, the dichotomous response variable
is drawn with a uniform
distribution, i.e.
if
and zero otherwise, with
. This procedure is repeated 1,000 times resulting in 1,000 simulation runs for any set of parameters. As per simulation run, it is determined whether the null hypothesis of homogeneity is rejected or not using the approaches in Section 2 which all ignore the individual heterogeneities. The percentage of rejected null hypotheses over the number of simulations is being reported.
Note that the final model in has a study-specific latent variable
for the control group, and a study-specific latent variable
for the treatment group. The variance–covariance matrix for these latent variables is given by
when ,
and
; the correlation between
and
is equal to 1 and there is no treatment–study interaction. In case
or
, there is a treatment–study interaction and thus heterogeneity in treatment effects. The size of the heterogeneity is given by
when
. The larger this value, the higher the statistical power of the methods. Thus, the heterogeneity reduces when
increases to 1. For the simulation, a choice was made to take
. Type I error was simulated assuming
and
. The different simulation settings that we studied are presented in Table .
4. Results
4.1. Meta-analysis of antituberculosis therapy
We applied the statistical methods presented in Section 2 to a meta-analysis on antituberculosis therapy (Pasipanodya, Srivastava, & Gumbo, Citation2012). This meta-analysis consisted of 26 different regimens (plans or regulated courses). In each regimen, slow acetylators (control) and rapid acetylators (treatment) were applied to two groups of patients. The objective was to test whether microbiological failure after therapy, relapse after therapy and acquired drug resistance (ADR) were as likely to occur among slow and rapid acetylators. The authors limited their selection of studies to randomized, controlled clinical trials in which outcomes were reported in table, graph or text format. All patients who were considered microbiological failures, relapses or ADR during treatment for slow and rapid acetylators were recorded. For each of the three different outcomes, the authors then calculated risk ratios (RR) with 95% confidence intervals using regimen as unit of analysis. Random-effects models were used to combine regimens when and the corresponding P-value <0.05; otherwise fixed-effects models were used to combine the RR.
For our study we only consider the results of one outcome, namely microbiological failure. The authors performed a test of homogeneity of the relative risks using the and found
with a corresponding
(Pasipanodya et al., Citation2012). The authors then applied mixed-effects modeling for this outcome, resulting in a combined RR of 2, with a confidence interval [1.5 ; 2.7] for the RR.
However, four of the 26 regimens contained zero cells, and the selected methods cannot deal with this unless a correction to the number of cells is made (Gart, Citation1966). We chose to remove these four regimens from our data, and apply all the selected approaches to the remaining datasets.
The test statistic, degrees of freedom and the P-value for rejecting the null hypothesis are presented in Table . Using 5% level of significance, the null hypothesis is only rejected by the random-effects logistic regression model. Additionally, the P-values range from 0.035 to 0.124. Thus, the proposed methods do not all seem to agree.
4.2. Simulation results
The full set of results is presented in the supplementary material. A subset of these results is shown in Tables –. Tables – present a subset of the simulation results where depends on age, gender and other heterogeneous factors
. They represent the results for the most realistic practical settings to contrast these results and to be able to compare our results with literature. Table presents a selection of the Type I error and the statistical power results of the test methods under the assumption of the binomial distribution. In such case, the proportion of a success,
, does not depend on age, gender or any other participant variables
. The random-effects logistic regression analysis occasionally showed numerical problems, in particular when heterogeneity between studies was small or not present at all. Note that in around 13.7% of all simulation runs the full random-effects logistic regression model could not converge and the results of these problematic simulation runs are eliminated from the results.
Table 1. Parameter values used for a simulation model
Table 2. Results of meta-analysis on the pharmacokinetic variability hypothesis for ADR and failure of antituberculosis therapy
Table 3. Type I error rate for ,
,
Table 4. Statistical power for ,
,
,
,
Table 5. Statistical power for ,
,
,
Table 6. Type I error and statistical power for ,
,
For the realistic practical settings (Table ), the fixed-effects logistic regression analysis has a slight increased Type I error, when the number of participants within studies is relatively small, irrespective of the number of studies, but this is still acceptable. When sample sizes within studies increase, the Type I error is closer to nominal. The Q-statistic, the Bliss statistic and the statistic constantly underestimate the Type I error (Table ). This is most apparent when only a third of the participants are exposed to the treatment. This gets slightly better when the number of studies increases (supplementary material). The
statistic has the smallest Type I error rates of all the methods, except for a few cases where the Peto statistic gives even smaller values. The random-effects logistic regression model mostly underestimates the Type I error rate when
and
are small. However as
and
increase, it overestimates the Type I error rate (Table ). This gets even worse when the number of studies increases (supplementary material). The Type I error rates of the chi-square methods using estimates of the odds ratio, i.e. the Breslow–Day statistic, the corrected Zelen statistic, Liang and Self’s
and its adjusted version
, are almost at the nominal value. However, when one-third of the participants is exposed to treatment, the chi-square methods stay just below the nominal level, but this disappears when the number of studies increases. When participants no longer have their individual risk, the Type I error rates (Table ) are mostly similar to the Type I error rates for the realistic practical settings, except for two cases. The fixed-effects logistic regression no longer gives liberal Type I error rates when there is no treatment effect
.
Investigating the power for the realistic practical settings (Tables and ), it obviously shows that the statistical power increases as the number of studies increases. This improved power for testing the interaction effect can be explained by the reduced sampling error due to increased sample size. Furthermore, the increase in and
results in higher power since they increase the treatment interaction effects. This is due to the term
in the size of heterogeneity. The statistical power is negatively correlated with
, with no interaction effect when
and
(as mentioned earlier). The highest statistical power is obtained by the fixed-effects logistic regression model, followed by the
statistics (Breslow–Day test, the corrected Zelen test,
and
). The lowest statistical power is obtained by the
and the random-effects logistic regression model, respectively. These results hold true irrespective of the number of studies. Although Bliss uses the Q-statistic differently from Cochran, the two tests behave almost identically but they are less powerful than the best test statistics. It should be noted that heterogeneity in treatment effects from differences in age effects across studies
is not increasing the power a lot, a result that holds even when the number of studies increases. This is due to the selected settings. The heterogeneity effect in the simulation was only increased with
, which is really small.
Comparing the power results from the setting that participants have a common risk (Table ) with the power results from the setting that individuals have their own individual risks (Tables and ) shows a few interesting differences. First, the power is substantially smaller when individuals are heterogeneous in their risk rates, which makes it more difficult to detect treatment heterogeneity across studies. The power values for testing treatment interactions increase slightly when there would be a treatment effect in case individuals have heterogeneous risk rates. However, this is opposite when individuals are not heterogeneous, since the power values for treatment heterogeneity is substantially smaller for than for
(Table and supplementary material). With respect to choosing the best methods for detecting treatment effect heterogeneity, no real difference could be detected.
5. Discussion
The purpose of this article was to investigate the performance of statistical methods that may test for the homogeneity of odds ratios in meta-analysis. Ten methods were applied to a real-life meta-analysis study and to practically realistic simulated datasets. The simulated datasets were created using different scenarios based on different numbers of studies, participants within studies, unbalanced number of observations for treatment and control groups, whether or not there is an actual treatment effect, individual risks through covariates and the level of heterogeneity.
The Peto statistic is somewhat unreliable with respect to its Type I error rate, a conclusion that has been reached earlier (Gavaghan et al., Citation2000). The fixed-effects logistic regression analysis is a slightly liberal test, although the inflated Type I error rate is not too large, and it reduces with larger number of studies. This conclusion has been reached earlier (Paul & Donner, Citation1992), although we could not demonstrate it for settings where all individuals have the same constant risk rates, the settings used in the literature. In our study the inflation happened when the proportion of success depends on individual covariates and with small numbers of studies. This interesting result merits further investigation to find out the exact analytical cause, but it is beyond the scope of this article.
The fact that the fixed-effects logistic regression has the highest power among all other methods is partly due to this inflated Type I error rate. The statistic can be considered a very conservative test. This conclusion had been reached earlier (Huedo-Medina et al., Citation2006), but only for continuous responses. It makes the
an unsuitable test statistic for homogeneity of odds ratios. The random-effects logistic regression analysis is also conservative, albeit less conservative than the
for small numbers of studies, but it is less powerful than
for a larger number of studies. This low power combined with the unreliable Type I error rate makes the random-effects model unsuitable for testing homogeneity; see also Bagheri et al. (Citation2011). The Q-statistic and the Bliss statistic can also be considered conservative tests, a conclusion reached earlier (Higgins & Thompson, Citation2002; Higgins et al., Citation2003; Huedo-Medina et al., Citation2006). However, heterogeneity in the individual’s risk rates makes it more apparent. The Type 1 error rates of the chi-square methods are close to the nominal value, but Paul and Donner (Citation1989) found
to have liberal Type 1 error rates in the setting of a small number of centers and a large number of units. In our study, however, the
(and all the
methods) hardly overestimates the Type I error in such setting, but we did not study large numbers of participants within studies. The Type I error rate of
is not affected by heterogeneity of individual’s risk rates. Since the allocation of participants to treatment and control groups seems to affect the statistical power, it should be taken into consideration. Additionally, the heterogeneity in individual’s risk rates reduces the power, in particular when there would be no treatment effects present.
Overall, the chi-square tests based on odds ratios and the fixed-effects logistic regression analysis perform best. We recommend the Breslow–Day test adjusted by Tarone for convenience purposes, since it is available in most statistical software packages and the other tests are not. The Breslow–Day test had been recommended before by Jones et al. (Citation1989), but only when the number of studies is large. We also recommend the Breslow–Day test when the number of studies is small. However, when a treatment effect is suspected to be present and we have only a small number of studies (, we recommend using the fixed-effects logistic regression over the
tests since it has a slightly higher power than the chi-square test statistics, with an almost nominal Type I error rate. We advise against using the Q-statistic, the Bliss statistic and the
statistic because of their poor performance, and due to the recent theoretical worries expressed by Hoaglin (Hoaglin, Citation2016). The Peto statistic and the random-effects logistic regression are not recommended either due to their poor performance.
This article has its limitations. First, the conclusions presented only hold for dichotomous outcomes. Second, although the simulation method creates data on a unit (subject) level, hence adding more variation, it uses normally distributed latent variables. This assumption may not always be realistic in practice. Finite mixture models for the latent variables can be a potentially useful method. More specifically, the beta-binomial distribution (Skellam, Citation1948) could be contemplated to perform the interaction test. Another suggestion is to use the methods that estimate between-study variance; see e.g. Brockwell and Gordon (Citation2001) and DerSimonian and Laird (Citation1986) for an overview. The estimated confidence intervals of the between-study variance can be used to test the homogeneity of effect sizes. Finally, Bayesian approaches to test the interaction effects between study and treatment effect are other options for future research.
Acknowledgments
The authors are very grateful to the reviewer whose comments helped improve the results section and thereby expanding the relevance thereof.
Supplementary material
Supplemental material for this article can be accessed at https://doi.org/10.1080/25742558.2018.1478698.
Additional information
Funding
Notes on contributors
Osama Almalik
Edwin van den Heuvel is a full-time professor at the Department of Mathematics and Computer Sciences in the Eindhoven University of Technology and he is the head of the Statistics group within the Stochastics section. The Statistics group develops and compares data-analytical methods for analyzing and sampling complex structured correlated datasets. It includes parameter estimation, model fitting, latent variable models, mixed models, missing data, statistical process control, survival and reliability theory, time series analysis, and statistical learning methods. One of the central themes is the analysis of high-dimensional temporal datasets and other large datasets. The group actively explores new research lines in data science. This current research article is part of a PhD project of Osama Almalik (the principal author) on meta-analysis for non-standard or complex situations under supervision of van den Heuvel. It is considered beneficial to medical researchers, as well as statisticians, when carrying out meta-analysis for combining effect sizes from cohort studies and clinical trials.
References
- Agresti, A. (2002). Categorical data analysis (2nd ed.). New York, NY: John Wiley & Sons.
- Bagheri, Z., Ayatollahi, S. M. T., & Jafari, P. (2011). Comparison of three tests of homogeneity of odds ratios in multicenter trials with unequal sample sizes within and among centers. BMC Medical Research Methodology, 11, 58.
- Breslow, N. E., & Day, N. E. (1980). Statistical methods in cancer research 1980; Volume I—the design and analysis of case-control studies, Vol 32 of IARC Scientific Publications. Lyon: International Agency for Research on Cancer.
- Bliss, C. L. (1952). The statistics of Bioassay: With special reference to the vitamins. New York, NY: Academic Press Inc.
- Breslow, N. E. (1994). Statistics in epidemiology: The case-control study. Journal of the American Statistical Association, 91(433), 14–28. doi:10.1080/01621459.1996.10476660
- Brockwell, S. E., & Gordon, I. R. (2001). A comparison of statistical methods for meta-analysis. Statistics in Medicine, 20(6), 825–840. doi:10.1002/(ISSN)1097-0258
- Cochran, W. G. (1954). The combination of estimates from different experiments. Biometrics, 10, 101–129. doi:10.2307/3001666
- DerSimonian, R., & Laird, N. (1986). Meta-analysis in clinical trials. Control Clinical Trials, 7(3), 177–188. doi:10.1016/0197-2456(86)90046-2
- Gart, J. J. (1966). Alternative analyses of contingency tables. Journal of the Royal Statistical Society Series B (Methodological), 28(1), 164–179.
- Gavaghan, D. J., Moore, R. A., & McQuay, H. J. (2000). An evaluation of homogeneity tests in meta-analyses in pain using simulations of individual patient data. Pain, 85(3), 415–424. doi:10.1016/S0304-3959(99)00302-4
- Halperin, M., Ware, J. H., Byar, D. P., Mantel, N., Brown, C. C., Koziol, J., … Green, S. B. (1977). Testing for interaction in an I×J×K contingency table. Biometrika, 64, 271–275.
- Higgins, J. P. T., & Thompson, S. G. (2002). Quantifying heterogeneity in a meta-analysis. Statistics in Medicine, 21(11), 1539–1558. doi:10.1002/sim.1186
- Higgins, J. P. T., Thompson, S. G., Deeks, J. J., & Altman, D. G. (2003). Measuring inconsistency in meta-analyses. British Medical Journal, 327(7414), 557–560. doi:10.1136/bmj.327.7414.557
- Hoaglin, D. C. (2016). Misunderstandings about Q and ‘Cochran’s Q test’ in meta-analysis. Statistics in Medicine, 35, 485–495. doi:10.1002/sim.v35.4
- Huedo-Medina, T. B., Sánchez-Meca, J., Marín-Martínez, F., & Botella, J. (2006). Assessing heterogeneity in meta-analysis: Q statistic or I2 index. Psychological Methods, 11(2), 193–206. doi:10.1037/1082-989X.11.2.193
- Jackson, D. (2006). The power of the standard test for the presence of heterogeneity in meta-analysis. Statistics in Medicine, 25, 2688–2699. doi:10.1002/(ISSN)1097-0258
- Jones, M. P., O’Gorman, T. W., Lemke, J. H., Woolson, R. F., & Monte Carlo, A. (1989). Investigation of homogeneity tests of the odds ratio under various sample size configurations. Biometrics, 45(1), 171–181. doi:10.2307/2532043
- Kulinskaya, E., Dollinger, M. B., & Bjørkestøl, K. (2011). Testing for homogeneity in meta-analysis. I. The one parameter case: Standardized mean difference. Biometrics, 67, 203–212. doi:10.1111/j.1541-0420.2010.01442.x
- Liang, K. Y., & Self, S. G. (1989). Tests of homogeneity of odds ratio when the data are sparse. Biometrica, 72, 353–358. doi:10.1093/biomet/72.2.353
- Mantel, N., & Haenszel, W. (1959). Statistical aspects of the analysis of data from retrospective studies of disease. Journal of the National Cancer Institute, 22, 719–748.
- McCulloch, C. E., Searle, S. R., & Neuhaus, J. M. (2008). Generalized, linear, and mixed models (2nd ed.). New Jersey: Wiley.
- Mzolo, T., Hendriks, M., & van den Heuvel, E. (2013). A comparison of statistical methods for combining relative bioactivities from parallel line bioassays. Pharmaceutical Statistics, 12(6), 375–84.
- Pasipanodya, J. G., Srivastava, S., & Gumbo, T. (2012). Meta-analysis of clinical studies supports the pharmacokinetic variability hypothesis for acquired drug resistance and failure of antituberculosis therapy. Clinical Infectious Diseases, 55(2), 169–177. doi:10.1093/cid/cis353
- Paul, S. R., & Donner, A. (1992). Small sample performance of tests of homogeneity of odds ratios in K 2 × 2 tables. Statistics in Medicine, 11(2), 159–165. doi:10.1002/sim.4780110203
- Sánchez-Meca, J., & Marín-Martínez, F. (1997). Homogeneity tests in meta-analysis: A Monte Carlo comparison of statistical power and Type I error. Quality and Quantity, 31, 385–399. doi:10.1023/A:1004298118485
- Skellam JG. (1948). A probability distribution derived from the binomial distribution by regarding the probability of success as variable between the sets of trials. Journal of the Royal Statistical Society, 10(2), 257–261.
- Sutton, A. J., & Higgins, J. P. (2008). Recent developments in meta-analysis. Statistics in Medicine, 27(5), 625–650. doi:10.1002/sim.2934
- Tarone, R. E. (1985). On heterogeneity tests based on efficient scores. Biometrika, 72, 91–95. doi:10.1093/biomet/72.1.91
- Verbeke, G., & Molenberghs, G. (2003). The use of score tests for inference on variance components. Biometrics, 59, 254–262. doi:10.1111/1541-0420.00032
- Wolfinger, R., & O’ Connell, M. (1993). Generalized linear mixed models: A pseudo-likelihood approach. Journal of Statistical Computation and Simulation, 4, 233–243. doi:10.1080/00949659308811554
- Woolf, B. (1955). On estimating the relationship between blood group and disease. Annual Human Genetics, 19, 251–253. doi:10.1111/j.1469-1809.1955.tb01348.x
- Yusuf, S., Peto, R., Lewis, J., Collins, R., & Sleight, P. (1986). Beta blockade during and after myocardial infarction: An overview of the randomized trials. Progress in Cardiovascular Diseases, 27(5), 335–371. doi:10.1016/S0033-0620(85)80003-7
- Zelen, M. (1971). The analysis of several 2 × 2 contingency tables. Biometrika, 58, 129–137.