1,119
Views
1
CrossRef citations to date
0
Altmetric
Special Section on Roles of Hypothesis Testing, p-Values, and Decision-Making in Biopharmaceutical Research

p-Values and Replicability: A Commentary on “The Role of p-Values in Judging the Strength of Evidence and Realistic Replication Expectations”

&
Pages 36-39 | Received 05 Jul 2020, Accepted 31 Jul 2020, Published online: 19 Feb 2021

We want to congratulate Gibson on a well-articulated and illuminating article. Gibson (Citation2020) addressed the so-called replication crisis in scientific research and argued that the crisis was to a large extent the result of excessive optimism based on unknowingly (and sometimes knowingly) overstated evidence. While some researchers have proposed to solve the perceived crisis by abandoning the concept of statistical significance or eliminating the use of p-values altogether, Gibson (Citation2020) cautioned that such measures would not actually solve the crisis.

In our opinion, a p-value does what it was designed to do. It indicates how likely for a single hypothesis test, one would observe a result as extreme or more extreme than the one observed under the null hypothesis. This is a single conditional probability with no account for multiplicity or selection.

Gibson (Citation2020) acknowledged that p-values and other inferential tools (such as confidence intervals and Bayes factors) are all prone to misuse. We feel that statisticians should take some responsibility for the misuse. Generations of researchers have been taught, many by statisticians, p-values and confidence intervals in a simple, mechanical way without the required emphasis on the need to consider the context in which these summary statistics are created and should be used. There may not have been adequate discussion on how issues such as multiplicity and selection would impact the interpretation of these summary statistics. While mathematical complexity could have been used as an excuse for not doing so in the early days, the effects could easily be illustrated using simple computer simulation and graphic displays in modern times to warn of their impact. In addition, unified calls for and rigorous practice of documented transparency about the selection and number of hypotheses being tested could go a long way to enable the scientific community to evaluate the worthiness of reported results, particularly if no adjustment is made for the effect of multiplicity and/or selection.

Having said the above, we are also cognizant of the fact that some practitioners knowingly ignore the rules in their desire to publish selected positive findings to advance their own agenda. Indeed, misuse could occur despite the best statistical tutoring and well-publicized set of guiding principles. The question is—does the fear for misuse justify the abolition of p-values? We think not, especially if transparency and appropriate checks are put in place as conditions of publication. In this commentary, we will discuss the relationship of p-value to replicability which has been defined by Benjamini (Citation2020) as getting essentially the same results when replicating the entire study, from enlisting subjects through collecting data, and analyzing the results, in a similar but not necessarily identical way. We will add some points to the ongoing dialogue. Some of the points have been touched on by Gibson. Others highlight select advances in the statistical literature in recent years.

1 The Diagnostic Test Analogy

For some time now, we have been promoting the idea of regarding a clinical trial as a diagnostic test (Chuang-Stein and Kirby Citation2017, chap. 4). The condition that a trial is to diagnose is the presence of a treatment effect of a certain magnitude Δ. Under this analogy, what we call “statistical power” for detecting Δ becomes the “sensitivity” of the test. The probability of a true negative under the hypothesis testing paradigm is the specificity of the test, which is 1 minus the Type I error rate.

Why do we want to draw this parallel? The primary reason is to take advantage of the scientific community’s familiarity with issues surrounding diagnostic tests and a general acceptance of the need to use a series of diagnostic tests to make a definitive diagnosis of a serious condition.

In the diagnostic world, we are interested in the positive predictive value of a diagnostic test (PPV). The PPV is the probability that the condition of interest is present if the test returns a positive result, and can be derived from the sensitivity and specificity of the test together with the prevalence rate of the condition in the population using Bayes’ rule. Similarly, a product developer should be interested in the probability that the product candidate indeed has a certain effect if a trial produces a positive result. The same can be said of regulators.

Suppose we run a small trial (e.g., a proof of concept trial) with a Type I error rate of 10% and 80% power to detect an effect of Δ. Suppose that past experience suggests a 10% success rate among product candidates investigated for the disorder. Bayes calculation will result in a PPV of 47%, that is, there is a 47% chance that the candidate has the desired effect. This probability is less than 50%! The PPV will be 67% if the trial has a Type I error rate of 5% and 90% power under the same assumption of a 10% success rate. While 67% is higher than 47%, it is nowhere close to a guaranteed replication of the previous positive finding.

When applying a diagnostic test for a condition in a population with a low prevalence rate, the PPV is usually low. An example is to use a skin test to check for the presence of tuberculosis infection among young students who have been admitted to college. In this case, a second test of a higher quality is typically needed to verify an initial positive finding. The need for a second test and the general expectation of a negative retest is well understood in the skin test example. Then, why, as a society, do we have so much expectation to replicate the positive finding from a first study, especially when the trial is small with limited performance quality in a field with a traditionally low success rate? Amrhein, Trafimow, and Greenland (Citation2019) put it well, “there is no replication crisis if we don’t expect replications,” in the article referenced by Gibson.

We want to point out that when testing the effectiveness of an anti-bacterial agent, a positive first trial is highly predictive of subsequent successes. This is due to the availability of highly predictive animal models and the use of the minimum inhibition concentration (MIC) to guide dose selection (Chuang-Stein, Heft, and Koury Citation2005). Unfortunately, many of the anti-bacterials found to be efficacious in earlier trials failed to become commercial products due to safety concerns.

2 Switching to Different Statistics?

Proponents of banning p-values have suggested reporting results using other statistics. A favorite alternative is a confidence interval for the parameter of interest. For a single parameter of interest, many consider confidence intervals more informative than p-values because they convey two pieces of vital information instead of one. Confidence intervals are constructed with coverage probabilities in mind. When dealing with multiple parameters originating from multiple comparisons, what can one say about the joint coverage probability of the multiple confidence intervals if the intervals are constructed individually based on the marginal distributions of the estimates?

As Benjamini (Citation2020) states, it is easy for our eyes to be drawn to the extreme observations. If confidence intervals are constructed in the usual manner without making any allowance for multiplicity or selective inference, the confidence interval based on the extreme observations will convey the wrong message on the likely range of the corresponding parameter. This suggests that, reporting on confidence intervals will neither solve the multiplicity nor the selective inference issue when these issues exist.

3 Selective Inference

It is imperative that the scientific community (especially statisticians) have a deeper understanding of selection bias. Gibson (Citation2020) refers briefly to discounting effects because of selection. In practice, it is not always clear how to do this. In our opinion this is an area that could benefit from much more research (see, e.g., Chuang-Stein and Kirby Citation2017, chap. 12; Kirby, Li, and Chuang-Stein Citation2020).

Gibson gave two examples on how the perceived strength of evidence led to subsequent actions. In the first example, not only did the follow-on trial (PRAISE-2) enroll a subgroup of patients in the first trial (PRAISE-1), but the endpoint was also changed. In the second example, the follow-on trial (ELITE-2) was performed to verify an unexpected reduction in all-cause mortality observed in the first trial (ELITE-1).

Working in the innovative pharmaceutical industry requires a heavy dose of optimism. Nevertheless, it is critical to temper it with sound reasoning. Both of the above two examples are from the pharmaceutical industry where one would expect there to be more awareness of the effects of multiplicity and selection since selective inference has found many victims in drug development. It would be interesting to know whether the decision to proceed in each case was taken with an awareness of the risk involved but the decision was to move forward with a calculated risk.

Subgroup selection, in the era of searching for personalized medicine, presents an almost irresistible temptation. Tarenflurbil is a selective Aβ42-lowering agent that has been shown in vitro and in vivo to reduce Aβ42 production in favor of less toxic forms of Aβ. A Phase 2 trial was conducted to investigate the efficacy of two doses of tarenflurbil against placebo during a 12-month double-blind period with 12-month extension in Alzheimer’s disease (AD) patients. The study was conducted in Canada and UK and enrolled 210 patients between November 3, 2003 and April 24, 2006 (Wilcock et al. Citation2008). A Phase 3 trial was initiated in the US before all results from the Phase 2 trial were known. The Phase 3 trial initially enrolled AD patients with mild or moderate symptoms. After analyses of the full Phase 2 data suggested that patients with mild symptoms responded better to tarenflurbil, enrolment to the Phase 3 trial was restricted to patients with mild symptoms only.

Data on the 1684 patients with mild AD symptoms in the Phase 3 trial failed to show a significant effect of tarenflurbil (Green et al. 2009).

An extremely low historical success rate in developing treatments for AD means a low PPV for a positive trial in the first place. When signals came from a subgroup, this selective inference further dampened the chance for success.

4 Subgroup Selection Supported by Science and Confirmed in a Stepwise Manner

There are situations when emerging biology provided a rationale for the selection of a subgroup. Fletcher (Citation2011) described the KRAS journey for panitumumab (Vectibix) which represented an early story in a sponsor’s search for targeted therapy. Panitumumab is a fully human IgG2 monoclonal antibody directed against epidermal growth factor receptor (EGFR). According to Fletcher (Citation2011), the development program for panitumumab included three Phase 3 trials for using panitumumab as a 1st line, 2nd line, and 3rd line treatment for metastatic colorectal cancer (mCRC).

The Phase 3 trial for the 3rd line mCRC reported an estimated hazard ratio of 0.54 (p-value <0.0001) for progression-free survival (PFS). The benefit on the median PFS is about one week. Emerging basic science at the time led to the KRAS hypothesis which postulates a lack of response in tumor with KRAS mutations to anti-EGFR monoclonal antibodies. The hypothesis was first tested using results in the 3rd line trial where KRAS status for tumor was obtained after the completion of the trial by an independent lab with no knowledge of randomization and trial outcome. A statistical analysis plan to test the KRAS hypothesis was developed before the unblinding of the KRAS status. This prospective/retrospective analysis showed a large difference in panitumumab effect by KRAS status. The p-value for testing the panitumumab by KRAS interaction is <0.0001.

Following the retrospective verification of the KRAS hypothesis, Fletcher (Citation2011) stated that the protocols for 1st and 2nd line treatment were amended to

  • Prospectively analyze the primary outcomes (PFS for 1st line and PFS/overall survival for 2nd line) by KRAS status;

  • Test patients with mutant KRAS only if results in wild-type KRAS are positive to control for the overall Type I error rate;

  • Increase the enrollment to ensure adequate power for testing in the wild-type KRAS patients.

The KRAS hypothesis was subsequently confirmed in the 1st line and 2nd line trials for the PFS endpoint. Patients with KRAS mutant tumors experienced no benefit (2nd line combination with chemo) or worse outcomes (1st line combination with chemo) with panitumumab.

In this example, the identification of the subgroup was based on independent scientific and biological knowledge. The verification was done in a rigorous and stepwise fashion. When there is corroborating information to support the selection of a subgroup, the probability that a positive treatment effect in the subgroup is real is substantially increased.

5 Probability of Success: Another Way to Estimate the Reproducibility Probability

Gibson (Citation2020) reviewed reproducibility probability discussed by researchers like Goodman (Citation1992), Senn (Citation2002), and Shao and Chow (Citation2002). The reproducibility probability is defined as the probability of replicating the statistically significant results of an initial study in a subsequent identical study conducted under the same conditions.

The earlier researchers proposed to estimate the reproducibility probability by calculating the power of the subsequent trial treating the observed treatment effect in the initial trial as if it were the true value of the parameter of interest. Under this calculation, the variability of the treatment effect estimate from the initial trial is not taken into account.

More recently, O’Hagan, Stevens, and Campbell (Citation2005) and Chuang-Stein (Citation2006) proposed to incorporate the uncertainty in our knowledge of the true treatment effect when estimating the likelihood that a future trial will be successful. The idea is to calculate the statistical power at various possible values of the true parameter and obtain a weighted average of the statistical powers using weights that reflect the likelihood of these possible values of the parameter. O’Hagan, Stevens, and Campbell (Citation2005) called the resulting probability the “assurance” while Chuang-Stein call it the probability of success (PoS). When the follow-on trial is an exact replicate of the initial trial, PoS becomes an estimate for the reproducibility probability.

The concept of the PoS has since progressed from a single trial to a confirmatory stage, to the technical and regulatory success for a product candidate (Chuang-Stein and Kirby Citation2017, chap. 9). Many examples in the statistical literature have shown that, for confirmatory trials, PoS could be much lower than the statistical power targeted in the trial. The latter contributes to the observation that the overall Phase 3 success rate is lower than what has been expected by researchers.

6 A Disciplined Approach

Our recommendations are similar to those offered by Gibson (Citation2020) but worded in a slightly different way:

  1. Understand that there is no compelling reason to expect replicability, particularly if the success rate in a given area is low. Statisticians can explain this fact to non-statisticians using the diagnostic test analogy.

  2. Document how hypotheses were chosen and whether they are confirmatory or exploratory.

  3. Apply adjustments for multiplicity and selection as necessary. If needed, seek the help of a statistician well-versed on the subject matter. Adjustment is needed whether one chooses to work with p-values or confidence intervals.

  4. Place results in the context of other relevant data and/or scientific theory when interpreting findings from a trial.

The last of the recommendations is in agreement with Gibson’s reference to the totality of the data but also brings in scientific credibility according to existing theory. The latter can be important as evidenced by the failure of confirmatory studies for Dimebon for the treatment of AD. There was no clear scientific rationale for the use of this antihistamine for the treatment of AD (FierceBiotech Report Citation2010).

We would, lastly, like to emphasize the point that statisticians or others with similar training need to be present to help assess the strength of evidence when key decisions are made. In practice, it is easy for non-statisticians to be persuaded by effect estimates that look big but are due to multiplicity and/or selection.

We agree with Gibson that abandoning p-values in scientific investigations and reporting will not fundamentally solve the replication problems. It is important for statisticians to help the scientific community understand that replication is not guaranteed even in the best of times.

References

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.