1,014
Views
2
CrossRef citations to date
0
Altmetric
Special Section on Roles of Hypothesis Testing, p-Values, and Decision-Making in Biopharmaceutical Research

Waking up to p: Comment on “The Role of p-Values in Judging the Strength of Evidence and Realistic Replication Expectations”

Pages 19-21 | Received 29 Jul 2020, Accepted 31 Jul 2020, Published online: 19 Feb 2021

I thank and commend Dr. Gibson on his thoughtful and thorough discussion regarding the role of p-values in evaluating evidence (Gibson Citation2020). Much like p-values themselves, the ASA’s statement on p-values has (Wasserstein and Lazar Citation2016) been misunderstood and misused, interpreted in some circles as discouragement of rather than education on their use. Some journals have considered abandoning: (i) the 0.05 level of significance, (ii) frequentist methods, or (iii) hypothesis testing/scientific method. Dr. Gibson logically laid-out the important issues and provides a guide for using p-values, which will help to overcome some of the misplaced reactions. I agree with many of his points. I offer a few additional comments for consideration as part of educational efforts to improve statistical practice.

1 Irreplicability

The p-value has come under scrutiny in part because of an irreplicability pandemic though is not the primary culprit responsible for it. Drivers of the pandemic include failure to recognize, acknowledge, and account for multiplicity; selective analyses and reporting; ignoring the distinction between hypothesis generation versus confirmation; and forming conclusions based on small studies and non-randomized evidence. These issues will affect the results and value of ANY tool that a statistician chooses to apply. The pandemic is less of an issue in clinical trials than in other scientific disciplines due to: (i) the careful attention paid to prespecification of endpoints and hypotheses providing for multiplicity context and a framework by which to control errors, (ii) tools such as blinding, standardization of measurement, and careful selection of control groups, (iii) clinical trial registration which helps to curtail selective reporting, (iv) training and education in the clinical trial community, (v) the appreciation of randomization as the foundation for statistical inference, and (vi) the protection of the integrity of randomization through application of the intention-to-treat principle.

2 Replicability ≠ Correctness

Replicability does not imply correctness, the real goal. Replicability is only a surrogate, perhaps a suboptimal one without the disciplined approach to research outlined above. One may imagine two observational studies that select patients in the same way and have similar issues of confounding and bias that are unable to be accounted for using fancy modeling. The results of the two studies may indeed agree … and be completely wrong. In diagnostic medicine, issues like this arise when the reference standard test is imperfect and the new test uses similar technology and thus has similar imperfections to the reference. Replicability value is blunted in scientific investigations with less rigor.

3 Over-Reliance, Misinterpretation, and Misuse of the p-Value

There is an over-reliance on p-values by medical journal editors and other areas of science. Misinterpretation and misuse of the p-value are common and limitations are well documented (Evans and Ting Citation2016). p-values are not necessary for descriptive studies. There are many instances where focusing on estimation with less emphasis on hypothesis testing may be prudent, enlightening, and educational. The perspective of the meaning of replicability may transform from whether hypothesis testing results agree to how well estimates agree, a concept that would have value.

4 The Silver Bullet: Improved Understanding of Uncertainty

Yet abolishing the p-value does not serve the profession or its constituents well. p-values are valid statistical measures that provide convenient conventions for communicating the uncertainty inherent in quantitative results. p-values and significance tests are among the most studied and best-understood statistical procedures. Surely, there are limitations and we should acknowledge them, and implement sound methods and processes to address those limitations. Nevertheless, it is a cornerstone and an important tool for our field, one that has contributed to the success of our profession and greatly advanced science through its proper application. Much of the controversy surrounding the use of p-values can be dispelled through a better appreciation of the causes of irreplicability and understanding of uncertainty. Improved education can remedy inappropriate use. Abandoning the tools will not address the underlying problem of statistical negligence. One must understand the science of data in order to advance science with data.

5 More Thoughtfulness Needed When Defining the Null Hypothesis

Researchers are often complicit in some of the crimes for which the p-value has been convicted. A null hypothesis of “no effect” is typically defined. A study is conducted and a p-value that conditions on the null (no effect) is then reported. The p-value is then criticized as not informing us about clinical relevance. If the research question concerns clinical relevance, should not researchers define a null consistent with that goal, so that when a test is conducted, a significant p-value indeed informs regarding clinical relevance? Perhaps we should stop blaming the tool (Weinberg Citation2001; Benjamini Citation2016). Instead, we can focus attention on encouraging attentiveness to defining relevant hypotheses, while discouraging misuse p-values antithetical to rigorous science.

Initial reaction to defining the null hypothesis based on clinical relevance rather than no effect, may be “that is too much” as many prioritize a positive trial over one that effectively evaluates the value of an intervention. There may be a useful middle ground where for example, a null of no effect is tested at 0.05 and a null of a significant effect is evaluated as a significance level of, for example, 0.20. The recent Food and Drug Administration (FDA) guidance on the development of vaccines to prevent COVID-19 takes a similar approach (FDA Citation2020). We spend a lot of time thinking about what the alternative hypothesis should be, yet our processes have become so automated that only infrequently do we critically think about what the null should be despite the fact that conclusions from hypothesis testing being focused on inferences regarding the null. Defining null and alternative hypotheses is at the discretion of the study investigators. We can be more thoughtful.

6 More Thoughtfulness Needed When Defining Error Rates

Similarly, we need to be more thoughtful about what error rates are acceptable when testing hypotheses. The selection of error rates should be a conscious calculated decision rather than a compulsory one. Consider the following: up until this year, Ebola was an untreatable disease with a high mortality rate. If a trial evaluating a potential treatment for Ebola was being conducted, why would the Type I error that we are willing to accept for that trial, be the same for a trial that is evaluating a treatment for a nonfatal disease that has proven effective and safe treatment alternatives? In fact for the Ebola trial, Type II error may be more important than Type I error (Evans and Ting Citation2016).

Conventions for significance thresholds, for example, p < 0.05, are sometimes necessary and helpful. Deviation from convention or establishing a new convention is warranted in some stances. It is noteworthy that conventions for error rates vary by discipline and purpose of analyses (e.g., hypothesis generating vs. confirmation). For example, a threshold of p < 0.0000005 has become a standard for common-variant GWAS (Panagiotou, Ioannidis, and for the Genome-Wide Significance Project Citation2012), while in high-energy physics the standard threshold for “evidence of a particle” p < 0.003 and for “discovery” is p < 0.0000003 (https://blogs.scientificamerican.com/observations/five-sigmawhats-that/). Thresholds should be explicitly defined based on study goals, carefully considering the consequences of incorrect decisions.

The strict rule of developing conclusions based on the “one-size fits all” criterion is suboptimal. This is particularly true in trials when there is such focus on often a single primary endpoint. The overall value of an intervention depends on many other factors including efficacy, safety, and quality of life outcomes within the context of similar attributes of therapeutic alternatives. How can we make an informed decision about a therapy based on a single variable without considering results on these other outcomes? Methods are evolving where global effects on patients can be evaluated in a more meaningful way (Evans et al. Citation2015; Evans and Follmann Citation2016).

7 Use p-Values in Combination With Other Measures and Tools

p-values are not the be-all and end-all of statistical analysis and represent only one tool for assessing evidence. They are best used in combination with other tools (Perera, Oke, and Fanshawe Citation2020). Different measures of uncertainty complement one another with no single measure serving all purposes. As a general rule, it is wise to provide confidence interval estimates of effects of interest to further inform regarding the magnitude of effect and the uncertainty with which it is estimated. Confidence intervals can be used to “rule out” potential effects with reasonable confidence. This has value in positive and negative trials that often goes underappreciated. A high p-value may indicate that we are unable to rule out, for example, negligible effects, but this does not necessarily imply that the effects are negligible as we may be equally unable to rule out very significant effects (Evans, Li, and Wei Citation2007). We must acknowledge however that other tools can also be misinterpreted and misused similarly to the p-value.

Although often not viewed as a statistical issue, providing better guidance regarding the practical importance of effect sizes will improve our understanding of the meaning of study findings (Aguinis, Vassar, and Wayant Citation2020). It does not help to report the magnitude of effect if we do not know how to interpret it in a meaningful way.

Given the importance of the effect magnitude and the varying perspectives of what constitutes an important or relevant effect, we might consider reporting a standard plot of the confidence level (vertical axis) by which various effect magnitudes (horizontal axis) can be ruled out. This will provide a useful visual that enables evaluators of the evidence to assess the strength of the evidence in relation to the effect magnitude.

8 Alternative Measures and Approaches Have Similar Limitations

Some researchers have proposed eliminating use of the p-values offering alternatives. Some are quite helpful. Market corrections to the value of some alternatives, sometimes promoted with a degree of commercialism, are necessary as they are subject to the same and sometimes additional limitations as the p-value. One could replace hypothesis testing and p-values with confidence interval estimation. This translates to trading multiple testing issues for multiple estimation and coverage probability issues. Others propose using, for example, likelihood ratios or proportional odds, but these similarly need information regarding their uncertainty and defined thresholds if decisions were to be made, and rely on assumed models. Bayesian approaches have attractions as strong universally applied priors may increase replicability rates though at the expense of error rates. Bayesian methods need further selection and modeling of priors and hierarchical structures, and although they are not concerned with error rates and biases, this does not imply that they avoid them, and can indeed inflate them. The selection of design and analysis methodologies should be based on the objectives of the study and context characteristics rather than pressures to implement trending methodologies.

9 p-Values Are Continuous Measures That Are Sometimes Used to Aid Binary Decision-Making

Professorial statisticians often educate students and practitioners about dichotomization of continuous measures, cautioning about how important information can be lost. Yet p-values are interpreted and sometimes reported as binary statistics. Observed p-values should be reported rather than, for example, p < 0.05 or p > 0.05. Although a p-value of 0.06 and 0.86 each result in a failure to reject a null hypothesis when testing at 0.05, they may be interpreted quite differently in light of other evidence such as results for other outcomes. Interpretation needs context. There is room for reasonable flexibility in interpretation even when prespecified hypothesis testing parameters have been defined. As a veteran of many FDA advisory committee meetings, I would not feel conflicted voting in favor of an intervention with a primary efficacy result of 0.06 in a well-designed and conducted trial when secondary and safety outcomes are favorable such that the value of the intervention from a benefit:risk perspective exceeds treatment alternatives. Nor would I feel conflicted voting against an intervention with a p-value of 0.04 if from a study that was flawed or the benefit:risk balance is inferior to therapeutic alternatives.

10 Conclusions

I commend Dr. Gibson for his thoughtful proposal. Through work and discussions like this, I am hopeful that we can educate and reform, bring clarity to some of the confusion, and develop principles that can put the scientific community on a good path.

References

  • Aguinis, H., Vassar, M., and Wayant, C. (2020), “On Reporting and Interpreting Statistical Significance and p Values in Medical Research,” BMJ Evidence-Based Medicine, DOI: 10.1136/bmjebm-2019-111264.
  • Benjamini, Y. (2016), “It’s Not the p-Values Fault—Online Discussion on ASA Statement on Statistical Significance and p-Values,” The American Statistician, available at https://ndownloader.figstatic.com/files/5368448.
  • Evans, S. R., and Follmann, D. (2016), “Using Outcomes to Analyze Patients Rather than Patients to Analyze Outcomes: A Step toward Pragmatism in Benefit:Risk Evaluation,” Statistics in Biopharmaceutical Research, 8, 386–393, DOI: 10.1080/19466315.2016.1207561.
  • Evans, S. R., Li, L., and Wei, L. J. (2007), “Data Monitoring in Clinical Trials Using Prediction,” Drug Information Journal, 41, 733–742, DOI: 10.1177/009286150704100606.
  • Evans, S. R., Rubin, D., Follmann, D., Pennello, G., Huskins, W. C., Powers, J. H., Schoenfeld, D., Chuang-Stein, C., Cosgrove, S. E., Fowler, V. G., Jr., Lautenbach, E., and Chambers, H. F. (2015), “Desirability of Outcome Ranking (DOOR) and Response Adjusted for Duration of Antibiotic Risk (RADAR),” Clinical Infections Disease, 61, 800–806, DOI: 10.1093/cid/civ495.
  • Evans, S. R., and Ting, N. (2016), Fundamental Concepts for New Clinical Trialists, Boca Raton, FL: CRC Press.
  • Food and Drug Administration (FDA) (2020), “Guidance for Industry: Development and Licensure of Vaccines to Prevent COVID-19,” Center for Biologics Evaluation and Research, U.S. Department of Health and Human Services, available at https://www.fda.gov/media/139638/download.
  • Gibson, E. W. (2020), “The Role of p-values in Judging the Strength of Evidence and Realistic Replication Expectations,” Statistics in Biopharmaceutical Research, DOI: 10.1080/19466315.2020.1724560.
  • Panagiotou, O. A., Ioannidis, J. P. A., and for the Genome-Wide Significance Project (2012), “What Should the Genome-Wide Significance Threshold Be? Empirical Replication of Borderline Genetic Associations,” International Journal of Epidemiology, 41, 273–286, DOI: 10.1093/ije/dyr178.
  • Perera, R., Oke, J., and Fanshawe, T. R. (2020), “A Year in Statistics—The View From the Trenches,” BMJ Evidence-Based Medicine, 25, 81–82, DOI: 10.1136/bmjebm-2019-111303.
  • Wasserstein, R. L., and Lazar, N. A. (2016), “The ASA Statement on p-Values: Context, Process, and Purpose,” The American Statistician, 70, 129–133, DOI: 10.1080/00031305.2016.1154108.
  • Weinberg, C. R. (2001), “It’s Time to Rehabilitate the p-Value,” Epidemiology, 12, 288–290, DOI: 10.1097/00001648-200105000-00004.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.