100
Views
4
CrossRef citations to date
0
Altmetric
Commentary

Imbalance p values for baseline covariates in randomized controlled trials: a last resort for the use of p values? A pro and contra debate

&
Pages 531-535 | Published online: 08 May 2018

Abstract

Background

Results of randomized controlled trials (RCTs) are usually accompanied by a table that compares covariates between the study groups at baseline. Sometimes, the investigators report p values for imbalanced covariates. The aim of this debate is to illustrate the pro and contra of the use of these p values in RCTs.

Pro

Low p values can be a sign of biased or fraudulent randomization and can be used as a warning sign. They can be considered as a screening tool with low positive-predictive value. Low p values should prompt us to ask for the reasons and for potential consequences, especially in combination with hints of methodological problems.

Contra

A fair randomization produces the expectation that the distribution of p values follows a flat distribution. It does not produce an expectation related to a single p value. The distribution of p values in RCTs can be influenced by the correlation among covariates, differential misclassification or differential mismeasurement of baseline covariates. Given only a small number of reported p values in the reports of RCTs, judging whether the realized p value distribution is, indeed, a flat distribution becomes difficult. If p values ≤0.005 or ≥0.995 were used as a sign of alarm, the false-positive rate would be 5.0% if randomization was done correctly, and five p values per RCT were reported.

Conclusion

Use of a low p value as a warning sign that randomization is potentially biased can be considered a vague heuristic. The authors of this debate are obviously more or less enthusiastic with this heuristic and differ in the consequences they propose.

Introduction

Since its introduction into biomedical literature, null hypothesis significance testing (NHST) has caused much debate.Citation1Citation4 Despite many cautions, NHST remains one of the most prevalent statistical procedures in biomedical literature.Citation5,Citation6 In 2016, Greenland et al reviewed overall 25 misinterpretations of NHST, p values, CIs, and powerCitation7 and recently, the American Statistical Association released a policy statement on statistical significance and p values, including “The widespread use of “statistical significance” (generally interpreted as ‘p≤0.05’) as a license for making a claim of a scientific finding (or implied truth) leads to considerable distortion of the scientific process.”Citation8

Given the many warnings and misuses of NHST, it is unclear in which situations NHST can play a relevant role in the biomedical and epidemiologic literature. Here, we focus on the use of p values to assess imbalances of baseline covariates between study groups of a randomized controlled trial (RCT). In 1990, Greenland summarized the advantages of randomization as follows: 1) it makes estimates of effect “statistically unbiased, in that the statistical expectation (average) of the estimate over the possible results equals the true value” and 2) “it provides a known probability distribution for the possible results under a specified hypothesis about the treatment effect”.Citation9

Table 1 of results from an RCT usually presents baseline characteristics of included patients by treatment groups. These tables are sometimes accompanied by p values that are associated with the statistical null hypothesis of no baseline imbalances of covariates between the treatment groups (called “covariate imbalance p value”, for the remainder abbreviated as CIP). If randomization was done properly, it can be expected that any baseline difference between treatment groups is solely due to chance. Epistemologically, it appears to be a paradox to test the null hypothesis of no imbalance if the mechanism that produced the covariate distributions of the treatment groups was a chance mechanism, that is, randomization. A valid randomization produces a flat distribution of the CIPs with p values ≤0.05 to be expected with a relative frequency of 5%. The aim of this debate is to illustrate the pro and contra of the use of CIPs for baseline covariates in RCTs.

Argument for the use of CIPs (Baethge)

For all its futility, criticism of NHST seems to be successful in one respect: it has become the norm not to report p values in table 1 of an RCT paper. A steady stream of literature discouraged NHST for baseline differences,Citation10Citation13 culminating in consolidated standards of reporting trials (CONSORT)’s elaboration document: “Unfortunately significance tests of baseline differences are still common […]. Tests of baseline differences are not necessarily wrong, just illogical. […]. Such hypothesis testing is superfluous and can mislead investigators and their readers.”Citation14 So, CIP is bad practice and a sign that authors have no understanding of NHST and RCTs. Or have they?

The argument goes that if randomization went correctly, any distribution of variables among groups results from chance. But how can we be sure that randomization was correct? By a meticulous description of trial conduct, as CONSORT requires? Unfortunately, many authors do not follow: allocation concealment was adequately reported in merely 45% of all trials in journals endorsing CONSORT and in 22% of trials in journals not endorsing CONSORT.Citation15 Even if a trial looks good on paper, systematic error or fraud cannot be ruled out. Fanelli has meta-analytically estimated that 1 in 50 scientists self-reported to have “fabricated, falsified or modified data or results at least once”.Citation16

In fact, the literature provides many examples where low CIPs were a sign of scientific misconduct. Carlisle showed that in numerous RCTs by Yoshitaka Fujii – with 183 retractions, the frontrunner of Retraction Watch’s “Leaderboard” – the CIPs were below 0.0001.Citation17 George et al found a p value of 1.9×10−17 regarding baseline weight in a “randomized” trial that was retracted later.Citation18 Even p values of almost 1 or exactly 1 can attract attention: they may be indicative of an improbable lack of variance. Kunz et al found p values of 0.997 or 0.988 too good to be true in the COOPERATE study, which was retracted in 2009.Citation19 In a historical case, Fisher calculated chi-square statistics from Gregor Mendel’s publication in 1866, arrived at p values above 0.999, and concluded that Mendel had cheatedCitation20 – a controversial claim. But it is undisputed that Mendel’s results were biased.Citation20

The p values alone cannot distinguish the reasons for baseline imbalances: chance or bias, including fraud. This, however, should not lead to discarding the p value as a warning sign. It is a screening tool with low positive-predictive value – the way fecal occult blood testing is a screening tool for colorectal cancer. Here is an example.Citation21 In a paper reporting an RCT on a modified cesarean section, the authors provided baseline characteristics of intervention and control groups that, when we recalculated the p values, were suggestive of bias, eg, for educational status. Under the assumption that randomization was correct in this trial, one would expect an imbalance between the two groups, as it was documented for educational status (or an even larger imbalance), with a probability of 0.00016. When we raised this point in a letter to the editor, the authors replied that parents were asked to participate not only before randomization but also after randomization and after they knew what treatment they were planned to receive. Further, at the same point in the study, staff was asked to participate.Citation22 This approach is different from the ethical imperative of allowing patients to withdraw their consent at any time. While initially the study used randomization, this approach introduces the strong possibility of allocation bias. In fact, it is a plausible explanation for baseline imbalance.

Low CIPs, therefore, should prompt us to ask for the reasons and for potential consequences: Can the way the trial was conducted explain the imbalance? Is the imbalance of prognostic importance and should it be adjusted for (advisable only when the extent of the imbalance is clinically relevant)? As often in medicine, there cannot be a hard-and-fast rule of when to dig deeper into CIPs, but very low CIPs (eg, below 0.005) and hints of methodological problems should certainly give pause for thought.

With regard to scientific misconduct, low CIPs as a screening tool should eventually give way to better means to detect fraud.Citation23Citation25 There are also more sophisticated methods of screening for fraud, eg, Carlisle’s method,Citation17 but they are more difficult to apply. While statistically minded readers themselves can often calculate the p values from table 1, I fear this will not usually happen. Presenting CIPs may create a stronger incentive to discuss bias, or even worse, potential fraud. The way it is now, imbalances are often not discussed (eg, Schramm et alCitation26). In contrast to other applications of NHST and under the assumption of sufficient study size, CIPs are what we need in evaluating randomization: the probability of the observed or a stronger baseline covariate imbalance if chance was the only explanation. It seems odd that the p values have become outlawed in precisely one of the few places where they should have a role.

Arguments against the use of CIPs in RCTs (Stang)

A fair randomization produces the expectation that the distribution (sic) of CIPs follows a flat distribution. It does not produce an expectation related to a single CIP. In contrast, Baethge does not use the distributional expectation, but uses an expectation related to single CIPs.

Furthermore, the distribution of CIPs in RCTs can be influenced by three factors, so that the expectation of a flat distribution of CIPs is not met anymore. First, the CIP distribution becomes distorted if the baseline covariates for which the p values are calculated are correlated with each other. For example, Flaherty et al presented the baseline characteristics of 322 randomized patients with metastatic melanoma with a BRAF mutation who either received trametinib or chemotherapy. They presented the percentage of “disease at ≥3 sites” and the percentage of “history of brain metastasis” as 57% versus 52% and 4% versus 2%, respectively, for the two treatment groups without p values. These two baseline characteristics are associated with each other. Patients with disease at ≥3 sites have a higher probability to have a history of brain metastasis than patients with disease at <3 sites.Citation27 Second, unblinded study teams of RCTs can produce differential misclassification or differential mismeasurement of baseline covariates. This differential bias also influences the distribution of CIPs. Third, the median number of CIPs presented in the tables of published RCTs is 16, which makes the study of the CIP distribution quite unreliable.Citation12 What does the reader learn about the distribution of CIPs if table 1 of the published RCT contains only a few CIPs with one out of them being below 0.005? For judging whether the realized CIP distribution in an RCT is, indeed, a flat distribution, the presentation of only a few CIPs in table 1 of an RCT is not sufficient evidence for this judgment. At best, investigators would present as many as possible CIPs graphically illustrated in a supplementary figure.

In addition, it is unclear to me what Baethge’s approach implies for CIPs between 0.005 and 0.995. Can they be considered as an all-clear signal?

A brief review of the RCTs published in the New England Journal of Medicine, Lancet, and JAMA (Medline search: randomized controlled trial [publication type] AND [“JAMA” {journal} OR “N Engl J Med” {journal} OR “Lancet” {journal}] AND 2017/05:2017/06 [dp]) for the months May and June 2017 revealed overall 57 published RCTs. With the exception of one RCT, all RCTs refrained from providing CIPs in table 1. Overall, 27 (47%) of the RCT papers provided a statement about the presence of any statistically significant imbalance and reported only those CIPs that were significant at α=0.05. Eight out of these 27 RCT papers found statistically significant differences for at least one baseline covariate. Another 18 RCT papers (32%) only gave qualitative statements about imbalances at baseline. Interestingly, 19% did not provide any statement about baseline imbalances. Only one paper actually reported CIPs for all baseline covariates presented in table 1. This mini review shows that CIPs are only rarely presented nowadays in RCT publications of top medical journals. Obviously, for the use of the proportion of CIPs being ≤0.005 as a quality control measure of the randomization, a substantial number of p values should be presented to learn anything about the CIP distribution. If 16 p values are published per RCTCitation12 and a p value of ≤0.005 is interpreted as a warning, then the false-positive rate is 8% for studies where randomization has been properly performed. This rate increases to 16% if one interprets the p values which are ≥0.995 as a warning.

Conclusion

The study of CIPs in RCTs to detect potential bias related to the random assignment of the treatments may be called a heuristic (“rule of thumb”). According to the Cambridge Dictionary of Philosophy, a heuristic is defined as “A rule adopted to reduce the complexity of tasks; a heuristic may not reach a solution even if there is one, or may provide an incorrect answer (as opposed to an algorithm, ie, mental short cut).”Citation28 The study of the distribution of CIPs in RCTs does not reach a solution, or it even may provide an incorrect answer in case of correlated baseline covariates or differential misclassification or differential mismeasurement of baseline covariates. Besides these theoretical objections, the heuristic is hampered by the fact that tables of RCTs usually contain only a few covariates, so that a distribution of CIPs is hard to study. One is left with a kind of “cherry picking” of very low CIPs when they are reported. The authors of this debate are obviously more (Baethge) or less enthusiastic (Stang), with the former advocating the presentation of CIPs, its careful use as a screening tool, and its interpretation within the context of each study, while the latter emphasizes the dangers of misuse. The aim of this debate was to further trigger the discussion of the role of NHST in biomedical research that uses randomization.

Statistical theory teaches us that randomization produces balance of baseline covariates in the long run, that is, over an infinite series of RCTs, but not necessarily in a single RCT. Therefore, a baseline imbalance of a prognostic factor in a single RCT due to chance is not a sign of bias. However, if chance produces imbalance of covariates, investigators consequently adjust for baseline imbalances, as imbalances by chance also produce mixing of effects.Citation29

Our debate is centered on the appropriateness of p values as a screening tool for imbalanced baseline covariates. It is noteworthy that other more elaborate approaches have been proposed for the investigation ofCitation23Citation25 and the adjustment for baseline imbalances, for example, propensity scores (PS). The individual PS refers to the probability for a subject in the study of being assigned to the intervention arm A rather than intervention arm B, given the patient’s characteristics at baseline. Leyrat et al proposed a c-statistic of the PS model to detect global baseline covariate imbalance in cluster RCTs. In the absence of baseline imbalance, the c-statistic of the PS model is expected to be close to 0.5. In the presence of imbalance, this c-statistic will be larger than 0.5. This procedure is still being tested and there remain unresolved questions in dealing with this procedure. For example, it is not clear how large the c-statistic has to be to decide that a relevant baseline imbalance is present.Citation30

For the detection and judgment of imbalances between the study groups, it remains important that descriptive statistics of the groups (categorical characteristics: percentage values; continuous characteristics: eg, mean values and SDs) are presented. Whether a baseline imbalance is meaningful or not depends on subject matter knowledge. For example, it is clinically relevant in a stroke prevention study if 30% diabetics are in one arm of the study and only 15% are diabetics in the other arm, regardless of the p value, as diabetes mellitus is a very relevant risk factor for stroke.

Disclosure

The authors report no conflicts of interest in this work.

References

  • BoringEGMathematical vs. scientific significancePsychol Bull191915335338
  • HogbenLTStatistical Theory: An Examination of the Contemporary Crisis in Statistical Theory From a Behaviourist ViewpointLondonGeorge Allen & Unwin1957
  • MorrisonDEHenkelREThe Significance Test Controversy: A ReaderChicago, IL, USAAldine Pub1970
  • CohenJThe earth is round (p<0.05)Am Psychol199449129971003
  • ChavalariasDWallachJDLiAHIoannidisJPEvolution of reporting P-values in the Biomedical Literature, 1990–2015JAMA2016315111141114826978209
  • StangADeckertMPooleCRothmanKJStatistical inference in abstracts of major medical and epidemiology journals 1975–2014: a systematic reviewEur J Epidemiol2017321212927858205
  • GreenlandSSennSJRothmanKJStatistical tests, P-values, confidence intervals, and power: a guide to misinterpretationsEur J Epidemiol201631433735027209009
  • WassersteinRLLazarNAThe ASA’s statement on p-values: context, process, and purposeAm Statistician2016702129133
  • GreenlandSRandomization, statistics, and causal inferenceEpidemiology1990164214292090279
  • AltmanDGDoreCJBaseline comparisons in clinical trialsLancet199033586821491531967441
  • SennSTesting for baseline balance in clinical trialsStat Med19941317171517267997705
  • AssmannSFPocockSJEnosLEKastenLESubgroup analysis and other (mis)uses of baseline data in clinical trialsLancet200035592091064106910744093
  • KnolMJGroenwoldRHGrobbeeDEP-values in baseline tables of randomised controlled trials are inappropriate but still common in high impact journalsEur J Prev Cardiol201219223123222512015
  • MoherDHopewellSSchulzKFCONSORT 2010 explanation and elaboration: updated guidelines for reporting parallel group randomised trialsBMJ2010340c86920332511
  • TurnerLShamseerLAltmanDGConsolidated standards of reporting trials (CONSORT) and the completeness of reporting of randomised controlled trials (RCTs) published in medical journalsCochrane Database Syst Rev201211MR00003023152285
  • FanelliDHow many scientists fabricate and falsify research? A systematic review and meta-analysis of survey dataPLoS One200945e573819478950
  • CarlisleJBThe analysis of 168 randomised controlled trials to test data integrityAnaesthesia201267552153722404311
  • GeorgeBJBrownAWAllisonDBErrors in statistical analysis and questionable randomization lead to unreliable conclusionsJ Paramed Sci20156315315426949506
  • KunzRWolbersMGlassTMannJFThe COOPERATE trial: a letter of concernLancet200837196241575157618468534
  • PiresAMBrancoJAA statistical model to explain the Mendel-Fisher controversyStat Sci2010254545565
  • BaethgeCBlettnerMFrieseKArmbrust2015: randomization questionableJ Matern Fetal Neonatal Med201629223730373126828764
  • ArmbrustRHenrichWRe: the Charite cesarean birth: a family-orientated approach of cesarean sectionJ Matern Fetal Neonatal Med2017301434527161576
  • BuyseMGeorgeSLEvansSThe role of biostatistics in the prevention, detection and treatment of fraud in clinical trialsStat Med199918243435345110611617
  • Al-MarzoukiSEvansSMarshallTRobertsIAre these data real? Statistical methods for the detection of data fabrication in clinical trialsBMJ2005331751126727016052019
  • van den BorRMVaessenPWJOostermanBJZuithoffNPAGrobbeeDERoesKCBA computationally simple central monitoring procedure, effectively applied to empirical trial data with known fraudJ Clin Epidemiol201787596928412468
  • SchrammEKristonLZobelIEffect of disorder-specific vs nonspecific psychotherapy for chronic depression: a randomized clinical trialJAMA Psychiatry201774323324228146251
  • FlahertyKTRobertCHerseyPMETRIC Study GroupImproved survival with MEK inhibition in BRAF-mutated melanomaN Engl J Med2012367210711422663011
  • AudiRThe Cambridge Dictionary of Philosophy2nd edCambridgeCambridge University Press1999
  • RothmanKJEpidemiologic methods in clinical trialsCancer1977394 Suppl17711775322841
  • LeyratCCailleAFoucherYGiraudeauBPropensity score to detect baseline imbalance in cluster randomized trials: the role of the c-statisticBMC Med Res Methodol201616926801083