58,652
Views
510
CrossRef citations to date
0
Altmetric
Articles

Abandon Statistical Significance

, , , &
Pages 235-245 | Received 30 Oct 2017, Accepted 06 Sep 2018, Published online: 20 Mar 2019

ABSTRACT

We discuss problems the null hypothesis significance testing (NHST) paradigm poses for replication and more broadly in the biomedical and social sciences as well as how these problems remain unresolved by proposals involving modified p-value thresholds, confidence intervals, and Bayes factors. We then discuss our own proposal, which is to abandon statistical significance. We recommend dropping the NHST paradigm—and the p-value thresholds intrinsic to it—as the default statistical paradigm for research, publication, and discovery in the biomedical and social sciences. Specifically, we propose that the p-value be demoted from its threshold screening role and instead, treated continuously, be considered along with currently subordinate factors (e.g., related prior evidence, plausibility of mechanism, study design and data quality, real world costs and benefits, novelty of finding, and other factors that vary by research domain) as just one among many pieces of evidence. We have no desire to “ban” p-values or other purely statistical measures. Rather, we believe that such measures should not be thresholded and that, thresholded or not, they should not take priority over the currently subordinate factors. We also argue that it seldom makes sense to calibrate evidence as a function of p-values or other purely statistical measures. We offer recommendations for how our proposal can be implemented in the scientific publication process as well as in statistical decision making more broadly.

1 The Status Quo and Two Alternatives

The biomedical and social sciences are facing a widespread crisis with published findings failing to replicate at an alarming rate. Often, such failures to replicate are associated with claims of huge effects from subtle, sometimes even preposterous, interventions. Further, the primary evidence adduced for these claims is one or more comparisons that are anointed “statistically significant”—typically defined as comparisons with p-values less than the conventional 0.05 threshold relative to the sharp point null hypothesis of zero effect and zero systematic error.

Indeed, the status quo is that p < 0.05 is deemed as strong evidence in favor of a scientific theory and is required not only for a result to be published but even for it to be taken seriously. Specifically, statistical significance serves as a lexicographic decision rule whereby any result is first required to have a p-value that attains the 0.05 threshold and only then is consideration—often scant—given to such factors as related prior evidence, plausibility of mechanism, study design and data quality, real world costs and benefits, novelty of finding, and other factors that vary by research domain (for want of a better term, we hereafter refer to these collectively as the currently subordinate factors).

Traditionally, the p < 0.05 rule has been considered a safeguard against noise-chasing and thus a guarantor of replicability. However, in recent years, a series of well-publicized examples (e.g., Carney, Cuddy, and Yap Citation2010; Bem Citation2011) coupled with theoretical work has made it clear that statistical significance can easily be obtained from pure noise. Consequently, low replication rates are to be expected given existing scientific practices (Ioannidis Citation2005; Smaldino and McElreath Citation2016), and calls for reform, which are not new (see, e.g., Meehl Citation1978), have become insistent.

One proposal, suggested by Benjamin and 71 coauthors including distinguished scholars from a wide variety of fields, is to redefine statistical significance, “to change the default p-value threshold for statistical significance for claims of new discoveries from 0.05 to 0.005” (Benjamin et al. Citation2018). While, as they note, “changing the p-value threshold is simple, aligns with the training undertaken by many researchers, and might quickly achieve broad acceptance,” we believe this “quick fix,” this “dam to contain the flood” in the words of a prominent member of the 72 (Resnick Citation2017), is insufficient to overcome current difficulties with replication. Instead, we believe it opportune to proceed immediately with other measures, perhaps more radical and more difficult but also more principled and more permanent.

In particular, we propose to abandon statistical significance, to drop the null hypothesis significance testing (NHST) paradigm—and the p-value thresholds intrinsic to it—as the default statistical paradigm for research, publication, and discovery in the biomedical and social sciences. Specifically, rather than allowing statistical significance as determined by p < 0.05 (or some other threshold whether based on p-values, confidence intervals, Bayes factors, or some other purely statistical measure) to serve as a lexicographic decision rule in scientific publication and statistical decision making more broadly, we propose that the p-value be demoted from its threshold screening role and instead, treated continuously, be considered along with the currently subordinate factors as just one among many pieces of evidence.

To be clear, we have no desire to “ban” p-values or other purely statistical measures. Rather, we believe that such measures should not be thresholded and that, thresholded or not, they should not take priority over the currently subordinate factors. We also argue that it seldom makes sense to calibrate evidence as a function of p-values or other purely statistical measures.

In the remainder of this article, we discuss general problems with NHST that motivate our proposal to abandon statistical significance and that remain unresolved by the Benjamin et al. (Citation2018) proposal. We then discuss problems specific to the Benjamin et al. (Citation2018) proposal. We next offer recommendations for how, in practice, the p-value can be demoted from its threshold screening role and instead be considered as just one among many pieces of evidence in the scientific publication process as well as in statistical decision making more broadly. We conclude with a brief discussion.

2 Problems General to Null Hypothesis Significance Testing

2.1 Preface

As noted, the NHST paradigm upon which the status quo and the Benjamin et al. (Citation2018) proposal rest is the default statistical paradigm for research, publication, and discovery in the biomedical and social sciences (see, e.g., Morrison and Henkel Citation1970; Sawyer and Peter Citation1983; Gigerenzer Citation1987; McCloskey and Ziliak Citation1996; Gill Citation1999; Anderson, Burnham, and Thompson Citation2000; Gigerenzer Citation2004; Hubbard Citation2004). Despite this, it has been roundly criticized both inside and outside of statistics over the decades (see, e.g., Rozenboom Citation1960; Bakan Citation1966; Meehl Citation1978; Serlin and Lapsley Citation1993; Cohen Citation1994; McCloskey and Ziliak Citation1996; Schmidt Citation1996; Hunter Citation1997; Gill Citation1999; Gigerenzer Citation2004; Gigerenzer, Krauss, and Vitouch Citation2004; Briggs Citation2016; McShane and Gal Citation2016, Citation2017). Indeed, the breadth of literature on this topic across time and fields makes a complete review intractable. Consequently, we focus on what we view as among the most important criticisms of NHST for the biomedical and social sciences.

2.2 Implausible Null Hypothesis

In the biomedical and social sciences, effects are typically small and vary considerably across people and contexts. In addition, measurements are often highly variable and only indirectly related to underlying constructs of interest; thus, even when sample sizes are large, the possibilities of systematic bias and variation can result in the equivalent of small or unrepresentative samples. Consequently, estimates from any single study—the typical fundamental unit of analysis—are themselves generally noisy.

In addition, the null hypothesis employed in the overwhelming majority of applications is the sharp point null hypothesis of zero effect—that is, no difference among two or more treatments or groups—and zero systematic error—which encompasses both the adequacy of the statistical model used to compute the p-value (e.g., in terms of functional form and distributional assumptions) as well as any and all forms of systematic or nonsampling error which vary by field but include measurement error; problems with reliability and validity; biased samples; nonrandom treatment assignment; missingness; nonresponse; failure of double-blinding; noncompliance; and confounding.

The combination of these features of the biomedical and social sciences and this sharp point null hypothesis of zero effect and zero systematic error is highly problematic. Specifically, because effects are generally small and variable, the assumption of zero effect is false. Further, even were the assumption of zero effect true for some phenomenon, the effect under consideration in any study designed to examine this phenomenon would not be zero because measurements are generally noisy and systematically so. Consequently, the sharp point null hypothesis of zero effect and zero systematic error employed in the overwhelming majority of applications is implausible (Berkson Citation1938; Edwards, Lindman, and Savage Citation1963; Bakan Citation1966; Meehl Citation1990; Tukey Citation1991; Cohen Citation1994; Gelman et al. Citation2014; McShane and Böckenholt Citation2014; Gelman Citation2015) and thus uninteresting.

These problems are exacerbated under a lexicographic decision rule for publication as per the status quo and the Benjamin et al. (Citation2018) proposal. Specifically, because noisy estimates that attain statistical significance are upwardly biased in magnitude (potentially to a large degree) and often of the wrong sign (Gelman and Carlin Citation2014), a lexicographic decision rule results in a tarnished literature. In addition, because many smaller, less resource-intensive, noisier studies are more likely to yield (or can be made more likely to yield; Simmons, Nelson, and Simonsohn Citation2011) one or more statistically significant results than fewer larger, more resource-intensive, better studies, a lexicographic decision rule at least indirectly encourages the former over the latter. These issues are compounded when researchers engage in multiple comparisons—whether actual or potential (i.e., the “garden of forking paths”; Gelman and Loken Citation2014).

In sum, various features of the biomedical and social sciences—for example, small and variable effects, systematic error, noisy measurements, a lexicographic decision rule for publication, and research practices—make NHST and in particular the sharp point null hypothesis of zero effect and zero systematic error particularly poorly suited for these domains.

2.3 Categorization of Evidence

NHST is associated with a number of problems related to the dichotomization of evidence into the different categories “statistically significant” and “not statistically significant” (or, sometimes, trichotomization with “marginally significant” as an intermediate category) depending upon where the p-value stands relative to certain conventional thresholds. Indeed, one well-known criticism of the NHST paradigm is that the conventional 0.05 threshold—or for that matter any other one—is entirely arbitrary (Fisher Citation1926; Yule and Kendall Citation1950; Cramer Citation1955; Cochran Citation1976; Cowles and Davis Citation1982).

A related line of criticism suggests that the problem is with having a threshold in the first place: the dichotomization (or trichotomization) of evidence into different categories of statistical significance itself has “no ontological basis” (Rosnow and Rosenthal Citation1989). Specifically, Rosnow and Rosenthal (Citation1989) note that “from an ontological viewpoint, there is no sharp line between a ‘significant’ and a ‘nonsignificant’ difference; significance in statistics…varies continuously between extremes” and thus advocate “view[ing] the strength of evidence for or against the null as a fairly continuous function of the magnitude of p.”

While we agree treating the p-value continuously rather than in a thresholded manner constitutes an improvement, we go further and argue that it seldom makes sense to calibrate evidence as a function of the p-value. We hold this for at least three reasons. First, and in our view the most important, it seldom makes sense because the p-value is, in the overwhelming majority of applications, defined relative to the generally implausible and uninteresting sharp point null hypothesis of zero effect and zero systematic error. Second, because it is a poor measure of the evidence for or against a statistical hypothesis (Edwards, Lindman, and Savage Citation1963; Berger and Sellke Citation1987; Cohen Citation1994; Hubbard and Lindsay Citation2008). Third, because it tests the hypothesis that one or more model parameters equal the tested values—but only given all other model assumptions. These other assumptions—in particular, zero systematic error—seldom hold (or are at least far from given) in the biomedical and social sciences. Consequently, “a small p-value only signals that there may be a problem with at least one assumption, without saying which one. Asymmetrically, a large p-value only means that this particular test did not detect a problem—perhaps because there is none, or because the test is insensitive to the problems, or because biases and random errors largely canceled each other out” (Greenland Citation2017). We note similar considerations hold for other purely statistical measures.

2.4 Erroneous Scientific Reasoning

The NHST paradigm and the p-value thresholds intrinsic to it are not only problematic in and of themselves but also they routinely result in erroneous scientific reasoning. For example, researchers typically take the rejection of the sharp point null hypothesis of zero effect and zero systematic error as positive or even definitive evidence in favor of some preferred alternative hypothesis—a logical fallacy. In addition, they often make scientific conclusions largely if not entirely based on whether or not a p-value crosses the 0.05 threshold instead of taking a more holistic view of the evidence that includes the consideration of the currently subordinate factors. Further, they often confuse statistical significance and practical importance (see, e.g., Freeman Citation1993). Finally, they often incorrectly believe a result with a p-value below 0.05 is evidence that a relationship is causal (Holman et al. Citation2001).

In addition, because the assignment of evidence to different categories (e.g., statistically significant and not statistically significant) is a strong inducement to the conclusion that the items thusly assigned are categorically different, NHST encourages researchers to engage in dichotomous thinking, that is, to interpret evidence dichotomously rather than continuously. Specifically, researchers interpret evidence that reaches the conventional threshold for statistical significance as a demonstration of a difference, and, in contrast, they interpret evidence that fails to reach this threshold as a demonstration of no difference.

An example of erroneous reasoning resulting from dichotomous thinking is provided by Gelman and Stern (Citation2006) who show that applied researchers often fail to appreciate that “the difference between ‘significant’ and ‘not significant’ is not itself statistically significant.” Additional examples are provided by McShane and Gal (Citation2016) who show that researchers across a wide variety of fields including medicine, epidemiology, cognitive science, psychology, and economics (i) interpret p-values dichotomously rather than continuously, focusing solely on whether or not the p-value is below 0.05 rather than the magnitude of the p-value; (ii) fixate on p-values even when they are irrelevant, for example, when asked about descriptive statistics; and (iii) ignore other evidence, for example, the magnitude of treatment differences. McShane and Gal (Citation2017) show that even statisticians are susceptible to these errors.

2.5 Misinterpretation of the p-Value

A final criticism against the NHST paradigm pertains to common misinterpretations of the p-value. While formally defined as the probability of observing data as extreme or more extreme than that actually observed assuming the null hypothesis is true, the p-value has often been misinterpreted as, inter alia, (i) the probability that the null hypothesis is true, (ii) one minus the probability that the alternative hypothesis is true, or (iii) one minus the probability of replication. For example, Gigerenzer (Citation2004) reports an example of research conducted on psychology professors, lecturers, teaching assistants, and students. Subjects were given the result of a simple t-test of two independent means (t = 2.7, df = 18, p = 0.01) and were asked six true or false questions based on the result and designed to test common misinterpretations of the p-value. All six of the statements were false and, despite the fact that the study materials noted “several or none of the statements may be correct,” (i) none of the 45 students, (ii) only four of the 39 professors and lectures who did not teach statistics, and (iii) only six of the 30 professors and lectures who did teach statistics marked all as false (members of each group marked an average of 3.5, 4.0, and 4.1 statements, respectively, as false). For related results, see Oakes (Citation1986); Cohen (Citation1994); Haller and Krauss (Citation2002); Gigerenzer (Citation2018).

3 Problems Specific to the Benjamin et al. (Citation2018) Proposal

Beyond concerns about the NHST paradigm upon which the status quo and the Benjamin et al. (Citation2018) proposal rest, there are additional problems specific to the latter proposal. First, Benjamin et al. (Citation2018) propose the 0.005 threshold because it (i) “corresponds to Bayes factors between approximately 14 and 26” in favor of the alternative hypothesis and (ii) “would reduce the false positive rate to levels we judge to be reasonable.” However, little to no justification is provided for either of these choices of levels.

Second, Benjamin et al. (Citation2018) “restrict [their] recommendation to claims of discovery of new effects” which is problematic for at least two reasons. First, the proposed policy is rendered entirely impractical because they fail to define what constitutes a new effect; this is especially so in domains where research is believed to be incremental and cumulative. Second, the proposed policy would lead to incoherence when applied to replication—the very issue their proposal is meant to address. In particular, the order in which two independent studies of a common phenomenon are conducted ought to be irrelevant but is not under the Benjamin et al. (Citation2018) proposal. Specifically, given one study with p < 0.005 and another with p ∈ (0.005,0.05), it would matter crucially which study was conducted first (and thus was “new”) under the definition of replication employed in practice (i.e., a subsequent study is considered to successfully replicate a prior study if either both fail to attain statistical significance or both attain statistical significance and are directionally consistent): the second (replication) study would be deemed a success under the Benjamin et al. (Citation2018) proposal if the first study was the p < 0.005 study but a failure otherwise.

Third, the fact that uncorrected multiple comparisons—both actual and potential—are the norm in applied research strictly speaking invalidates all p-values outside those from studies with preregistered protocols and data analysis procedures. This concern is acknowledged by Benjamin et al. (Citation2018). Nonetheless, what goes unacknowledged is that even with preregistration, p-values can be invalidated if the underlying model that generated the p-value is misspecified in an important manner.

Fourth, the mathematical justification underlying the Benjamin et al. (Citation2018) proposal has come under no small amount of criticism. Specifically, the uniformly most powerful Bayesian tests (UMPBTs) that underlie the proposal were introduced and defended by Johnson (Citation2013b) in parallel with his call in Johnson (Citation2013a)—and now repeated in Benjamin et al. (Citation2018)—to use 0.005 as the new threshold. We see a number of concerns with UMPBTs that we discuss in Appendix A. Perhaps most relevant for the biomedical and social sciences, the UMPBT approach is deeply entrenched in the century-old Neyman–Pearson formalism of binary decisions and 0–1 loss functions which does not in general map, even in an approximate way, to processes of scientific learning or costs and benefits. Consequently, the logic underlying the proposal to move to a lower p-value threshold avoids firmly confronting the nature of the issue: any such rule implicitly expresses a particular tradeoff between Type I and Type II error, but in reality this tradeoff should depend on the costs, benefits, and probabilities of all outcomes (Gelman and Robert Citation2014) which in turn depend on the problem at hand and which vary tremendously across studies and domains.

More speculatively, we are not convinced the more stringent 0.005 threshold for statistical significance would be helpful. In the short term, it could reduce the flow of low quality work that is currently polluting even top journals. In the medium term, it could motivate researchers to perform higher-quality work that is more likely to crack the 0.005 barrier. On the other hand, it could lead to even more overconfidence in results that do get published as well as a concomitant greater exaggeration of the effect sizes associated with such results. It could also lead to the discounting of important findings that happen not to reach the more stringent threshold. In sum, we have no idea whether implementation of the proposed 0.005 threshold would improve or degrade the state of science as we can envision both positive and negative outcomes resulting from it. Ultimately, while this question may be interesting if difficult to answer, we view it as outside our purview because we believe that thresholds whether based on p-values or other purely statistical measures are a bad idea.

Perhaps curiously, we do not necessarily expect that Benjamin et al. (Citation2018) would disagree with our criticism that their proposal is insufficient to overcome current difficulties with replication (or perhaps even with our own proposal to abandon statistical significance). After all, they “restrict [their] recommendation to claims of discovery of new effects” and recognize that “the choice of any particular threshold is arbitrary” and “should depend on the prior odds that the null hypothesis is true, the number of hypotheses tested, the study design, the relative cost of Type I versus Type II errors, and other factors that vary by research topic.” Indeed, “many of [the authors] agree that there are better approaches to statistical analyses than null hypothesis significance testing.”

4 Abandoning Statistical Significance

4.1 Summation and Recommendations

What can be done? Statistics is hard, especially when effects are small and variable and measurements are noisy. There are no quick fixes. Proposals such as changing the default p-value threshold for statistical significance, employing confidence intervals with a focus on whether or not they contain zero, or employing Bayes factors along with conventional classifications for evaluating the strength of evidence suffer from the same or similar issues as the current use of p-values with the 0.05 threshold. In particular, each implicitly or explicitly categorizes evidence based on thresholds relative to the generally implausible and uninteresting sharp point null hypothesis of zero effect and zero systematic error. Further, each is a purely statistical measure that fails to take a more holistic view of the evidence that includes the consideration of the currently subordinate factors, that is, related prior evidence, plausibility of mechanism, study design and data quality, real world costs and benefits, novelty of finding, and other factors that vary by research domain.

In brief, each is a form of statistical alchemy that falsely promises to transmute randomness into certainty, an “uncertainty laundering” (Gelman Citation2016) that begins with data and concludes with dichotomous declarations of truth or falsity—binary statements about there being “an effect” or “no effect”—based on some p-value or other statistical threshold being attained. A critical first step forward is to begin accepting uncertainty and embracing variation in effects (Carlin Citation2016; Gelman Citation2016) and recognizing that we can learn much (indeed, more) about the world by forsaking the false promise of certainty offered by such dichotomization.

Toward this end, we offer recommendations for how, in practice, the p-value be demoted from its threshold screening role and instead, treated continuously, be considered along with the currently subordinate factors as just one among many pieces of evidence. First, we recommend authors use the currently subordinate factors to motivate their data collection, statistical analysis, interpretation of results, writing, and related matters; we also recommend they analyze and report all of their data and relevant results. Second, we recommend editors and reviewers explicitly evaluate papers with regard to not only purely statistical measures but also the currently subordinate factors.

As a highly interdisciplinary research team with representation from statistics, political science, psychology, and consumer behavior, we are acutely aware that the implementation of our broad recommendations will and ought to vary tremendously across—and even within—domains. Further, we are not so supercilious to believe that we, by ourselves, are capable of providing concrete and specific guidance on the application of these recommendations across all or perhaps even in any of these domains. Indeed, we do not believe a “template” for our recommendations is possible or desirable. In fact, such a template could even be dangerous in that it might result in a rote and recipe-like application of our recommendations that would not be entirely dissimilar to, even if perhaps less harmful than, the current practice of rote and recipe-like application of NHST. To those who might argue that, without such a template, our recommendations are unrealistic or unlikely to be adopted in practice, we reiterate that statistics is hard and a formulaic approach to statistics is a principal cause of the current replication crisis. It is for these reasons we advocate this more radical and more difficult but also more principled and more permanent approach. Nonetheless, we below suggest some broad principles that show how our recommendations might be applied. We also provide a case study in Appendix B.

4.2 For Authors

We recommend authors use the currently subordinate factors to motivate their data collection, statistical analysis, interpretation of results, writing, and related matters; we also recommend they analyze and report all of their data and relevant results rather than focusing on single comparisons that attain some p-value or other statistical threshold.

One specific operationalization of the first part of our recommendation might be to include in their manuscripts a section that directly addresses how each of the currently subordinate factors motivated their various decisions regarding data collection, statistical analysis, interpretation of results, and writing in the context of the totality of the data and results. Such a section could, for example, discuss study design in the context of subject-matter knowledge and expectations of effect sizes as discussed by Gelman and Carlin (Citation2014). It could also discuss the plausibility of the mechanism by (i) formalizing the hypothesized mechanism for the effect in question and expounding on the various components of it, (ii) clarifying which components were measured and analyzed in the study, and (iii) discussing aspects of the results that support as well as those that undermine the hypothesized mechanism.

One might think that the second part of our recommendation —to analyze and report all of the data and relevant results—is such a fundamental principle of science that it need hardly be mentioned. However, this is not the case! As discussed above, the status quo in scientific publication is a lexicographic decision rule whereby p < 0.05 is virtually always required for a result to be published and, while there are some exceptions, standard practice is to focus on such results and to not report all relevant findings.

Given the current state of practice, authors may not have a sense for how they might go about this. Rather than attempt to provide broad guidance, we direct the reader to illustrations in clinical psychology (Tackett et al. Citation2014), epidemiology (Gelman and Auerbach Citation2016a,Citationb), political science (Trangucci et al. Citation2018), program evaluation (Mitchell et al. Citation2018), and social psychology and consumer behavior (McShane and Böckenholt Citation2017) as well as our case study in Appendix B.

4.3 For Editors and Reviewers

We recommend editors and reviewers explicitly evaluate papers with regard to not only purely statistical measures but also the currently subordinate factors; this should be far superior to the status quo, namely giving consideration—often scant—to the currently subordinate factors only once some p-value or other statistical threshold has been reached.

One specific operationalization of this recommendation might be to incorporate consideration of these factors into various stages of the review process. For example, editors could require reviewers to provide quantitative evaluations of each factor—including domain-specific factors determined by the editor—as well as an overall quantitative evaluation of the strength of evidence as a supplement to the current open-ended, qualitative evaluations. These could then be weighted by the editors’ publicly disclosed (or even reviewers’ own) importance rating of each factor. Additionally, editors could discuss and address the evaluation and importance of each factor in decision letters, thereby providing a more holistic view of the evidence.

One might object here and call our position naive: do not editors and reviewers require some bright-line threshold to decide whether the data supporting a claim is far enough from pure noise to support publication? Do not statistical thresholds provide objective standards for what constitutes evidence, and does this not in turn provide a valuable brake on the subjectivity and personal biases of editors and reviewers? Against these, we would argue that even were such a threshold needed, it would not make sense to set it based on the p-value given that it seldom makes sense to calibrate evidence as a function of this statistic and given that the costs and benefits of publishing noisy results varies by field. Additionally, the p-value is not a purely objective standard: different model specifications and statistical tests for the same data and null hypothesis yield different p-values; to complicate matters further, many subjective decisions regarding data protocols and analysis procedures such as coding and exclusion are required in practice and these often strongly impact the p-value ultimately reported. Finally, we fail to see why such a threshold screening rule is needed: editors and reviewers already make publication decisions one at a time based on qualitative factors, and this could continue to happen if the p-value were demoted from its threshold screening rule to just one among many pieces of evidence. Indeed, no single number—whether it be a p-value, Bayes factor, or some other statistical or nonstatistical measure—is capable of eliminating subjectivity and personal biases.

Instead, we believe it is entirely acceptable to publish a paper featuring a result with, say, a p-value of 0.2 or a 90% confidence interval that includes zero provided it is relevant to a theoretical or applied question of interest and the interpretation is sufficiently accurate. It should also be possible to publish a result with, say, a p-value of 0.001 without this being taken to imply the truth of some favored alternative hypothesis.

The p-value is relevant to the question of how easily a result could be explained by a particular null model, but there is no reason this should be the crucial factor in publication. A result can be consistent with a null model but still be relevant to science or policy debates, and a result can reject a null model without offering anything of scientific interest or policy relevance.

In sum, editors and reviewers can and should feel free to accept papers and present readers with the relevant evidence. We would much rather see a paper that, for example, states that there is weak evidence for an interesting finding but that existing data remain consistent with null effects than for the publication process to screen out such findings or encourage authors to cheat to obtain statistical significance.

4.4 Abandoning Statistical Significance Outside Scientific Publishing

While our focus has been on statistical significance thresholds in scientific publication, similar issues arise in other areas of statistical decision making, including, for example, neuroimaging where researchers use voxelwise NHSTs to decide which results to report or take seriously; medicine where regulatory agencies such as the Food and Drug Administration use NHSTs to decide whether or not to approve new drugs; policy analysis where nongovernmental and other organizations use NHSTs to determine whether interventions are beneficial or not; and business where managers use NHSTs to make binary decisions via A/B tests. In addition, thresholds arise not just around scientific publication but also within research projects, for example, when researchers use NHSTs to decide which avenues to pursue further based on preliminary findings.

While considerations around taking a more holistic view of the evidence and consequences of decisions are rather different across each of these settings and different from those in scientific publication, we nonetheless believe our proposal to demote the p-value from its threshold screening role and emphasize the currently subordinate factors applies in these settings. For example, in neuroimaging, the voxelwise NHST approach misses the point in that there are typically no true zeros and changes are generally happening at all brain locations at all times. Plotting images of estimates and uncertainties makes sense to us, but we see no advantage in using a threshold.

For regulatory, policy, and business decisions, cost-benefit calculations seem clearly superior to acontextual statistical thresholds. Specifically, and as noted, such thresholds implicitly express a particular tradeoff between Type I and Type II error, but in reality this tradeoff should depend on the costs, benefits, and probabilities of all outcomes.

That said, we acknowledge that thresholds—of a nonstatistical variety—may sometimes be useful in these settings. For example, consider a firm contemplating sending a costly offer to customers. Suppose the firm has a customer-level model of the revenue expected in response to the offer. In this setting, it could make sense for the firm to send the offer only to customers that yield an expected profit greater than some threshold, say, zero.

Even in pure research scenarios where there is no obvious cost-benefit calculation—for example, a comparison of the underlying mechanisms, as opposed to the efficacy, of two drugs used to treat some disease—we see no value in p-value or other statistical thresholds. Instead, we would like to see researchers simply report results: estimates, standard errors, confidence intervals, etc., with statistically inconclusive results being relevant for motivating future research.

While we see the intuitive appeal of using p-value or other statistical thresholds as a screening device to decide what avenues (e.g., ideas, drugs, or genes) to pursue further, this approach fundamentally does not make efficient use of data: there is in general no connection between a p-value—a probability based on a particular null model—and either the potential gains from pursuing a potential research lead or the predictive probability that the lead in question will ultimately be successful. Instead, to the extent that decisions do need to be made about which lines of research to pursue further, we recommend making such decisions using a model of the distribution of effect sizes and variation, thus working directly with hypotheses of interest rather than reasoning indirectly from a null model.

We would also like to see—when possible in these and other settings—more precise individual-level measurements, a greater use of within-person or longitudinal designs, and increased consideration of models that use informative priors, that feature varying treatment effects, and that are multilevel or meta-analytic in nature (Gelman Citation2015, Citation2017; McShane and Böckenholt Citation2017, Citation2018).

4.5 Getting From Here to There

How do we get from here—NHST, deterministic summaries, overconfidence in results, and statistical analysis focused on reporting just some of the data—to there—statistical analysis and interpretation of results that accepts uncertainty and embraces variation and that features full reporting of results rather than focusing on whatever happens to exceed some statistical threshold?

We have offered the recommendations that we believe will serve researchers best. However, we recognize that research takes place within an institutional structure that often encourages behavior that is counter to these recommendations. Researchers respond to the expectations of funding agencies in study design and editors and reviewers in writing. Conversely, funding agencies must choose among the submissions they receive and editors can only publish papers that are submitted to them. A careful research proposal that openly grapples with uncertainty may unfortunately lose out in the funding competition to a more traditional proposal that blithely promises 80% power based on selected and overestimated effect sizes. Similarly, a paper that presents all the data without making inappropriate claims of certainty may not get published in a journal that also receives submissions in which statistically significant results are presented at face value.

These institutional problems are difficult and we do not propose solutions to them. We imagine improvement will come in fits and starts, in several parallel tracks, all of which we and others have tried to contribute to in our applied and methodological research: improved statistical methods that move beyond NHST and include multilevel modeling, machine learning, statistical graphics, and other tools for analyzing and visualizing large amounts of data; applied examples using these improved methods, thereby demonstrating that it is possible to perform successful statistical analyses without aiming for deterministic results; theoretical work on the statistical effects of selection based on statistical significance and other decision criteria; and criticism of published work with gross overestimates of effect sizes or inappropriate claims of certainty. While we recognize change will likely require institutional reform including major modifications of current practices of funding agencies and editors and reviewers, we are also optimistic that some combination of recognition of error and awareness of alternatives can also motivate change.

5 Discussion

In this article, we have proposed to abandon statistical significance and offered recommendations for how this can be implemented in the scientific publication process as well as in statistical decision making more broadly. We reiterate that we have no desire to “ban” p-values or other purely statistical measures. Rather, we believe that such measures should not be thresholded and that, thresholded or not, they should not take priority over the currently subordinate factors.

While our proposal to abandon statistical significance may seem on the surface quite radical, at least one aspect of it—to treat p-values or other purely statistical measures continuously rather than in a thresholded manner—most certainly is not. Indeed, this was advocated by Fisher himself (Fisher Citation1956; Greenland and Poole Citation2013) as well as by other early and eminent statisticians including Pearson (Hurlbert and Lombardi Citation2009), Cox (Cox Citation1977, Citation1982), and Lehmann (Lehmann Citation1993; Senn Citation2001). It has also been advocated outside of statistics over the decades (see, e.g., Boring Citation1919; Eysenck Citation1960; Skipper, Guenther, and Nass Citation1967) and recently (see, e.g., Drummond Citation2015; Lemoine et al. Citation2016; Amrhein, Korner-Nievergelt, and Roth Citation2017; Greenland Citation2017; Amrhein and Greenland Citation2018). Finally, it is fully consistent with the recent American Statistical Association (ASA) Statement on Statistical Significance and p-values (“Principle 3: Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold;” Wasserstein and Lazar Citation2016). In sum, this aspect of our proposal is part of a long literature both inside and outside of statistics over the decades that stands in direct opposition to the threshold-based status quo and the Benjamin et al. (Citation2018) proposal.

Where our proposal might move beyond this literature is in three ways. First, we suggest that p-values or other purely statistical measures, thresholded or not, should not take priority over the currently subordinate factors (that said, others too have emphasized this including the recent ASA Statement which advises that “researchers should bring many contextual factors into play to derive scientific inferences, including the design of a study, the quality of the measurements, the external evidence for the phenomenon under study, and the validity of assumptions that underlie the data analysis” and cautions that “no single index should substitute for scientific reasoning;” Wasserstein and Lazar Citation2016). Second, as discussed above, while we believe treating the p-value continuously rather than in a thresholded manner constitutes an improvement, we go further and argue that it seldom makes sense to calibrate evidence as a function of the p-value or other purely statistical measures. Third, we offer recommendations for authors as well as editors and reviewers for how our proposal to abandon statistical significance can be implemented in the scientific publication process as well as in statistical decision making more broadly.

Our recommendations will not themselves resolve the replication crisis, but we believe they will have the salutary effect of pushing researchers away from the pursuit of irrelevant statistical targets and toward understanding of theory, mechanism, and measurement. We also hope they will push them to move beyond the paradigm of routine “discovery,” and binary statements about there being “an effect” or “no effect,” to one of continuous and inevitably flawed learning that is accepting of uncertainty and variation.

Acknowledgment

We thank the National Science Foundation, the Institute for Education Sciences, and the Office of Naval Research for partial support of Andrew Gelman’s work.

References

  • Amrhein, V., and Greenland, S. (2018), “Remove, Rather Than Redefine, Statistical Significance,” Nature Human Behaviour, 2, 4. DOI:10.1038/s41562-017-0224-0.
  • Amrhein, V., Korner-Nievergelt, F., and Roth, T. (2017). “The Earth is Flat (p > 0.05): Significance Thresholds and the Crisis of Unreplicable Research,” PeerJ, 5, e3544. DOI:10.7717/peerj.3544.
  • Anderson, D. R., Burnham, K. P., and Thompson, W. L. (2000), “Null Hypothesis Testing: Problems, Prevalence, and an Alternative,” Journal of Wildlife Management, 64, 912–923. DOI:10.2307/3803199.
  • Bakan, D. (1966), “The Test of Significance in Psychological Research,” Psychological Bulletin, 66(6), 423–437. DOI:10.1037/h0020412.
  • Bem, D. J. (2011), “Feeling the Future: Experimental Evidence for Anomalous Retroactive Influences on Cognition and Affect,” Journal of Personality and Social Psychology, 100, 407–425. DOI:10.1037/a0021524.
  • Benjamin, D. J., Berger, J. O., Johannesson, M., Nosek, B. A., Wagenmakers, E. J., Berk, R., Bollen, K. A., Brembs, B., Brown, L., Camerer, C., and Cesarini, D. (2018), “Redefine Statistical Significance,” Nature Human Behaviour, 2, 6–10. DOI:10.1038/s41562-017-0189-z.
  • Berger, J. O., and Sellke, T. (1987), “Testing a Point Null Hypothesis: The Irreconciliability of p Values and Evidence,” Journal of the American Statistical Association, 82, 112–122. DOI:10.2307/2289131.
  • Berkson, J. (1938), “Some Difficulties of Interpretation Encountered in the Application of the Chi-Square Test,” Journal of the American Statistical Association, 33, 526–536. DOI:10.1080/01621459.1938.10502329.
  • Boring, E. G. (1919), “Mathematical vs. Scientific Significance,” Psychological Bulletin, 16, 335–338. DOI:10.1037/h0074554.
  • Briggs, W. M. (2016), Uncertainty: The Soul of Modeling, Probability and Statistics, New York: Springer.
  • Carlin, J. B. (2016), “Is Reform Possible Without a Paradigm Shift?” The American Statistician, 901, 10 (supplemental material to the ASA statement on p-values and statistical significance).
  • Carney, D. R., Cuddy, A. J., and Yap, A. J. (2010), “Power Posing: Brief Nonverbal Displays Affect Neuroendocrine Levels and Risk Tolerance,” Psychological Science, 21, 1363–1368. DOI:10.1177/0956797610383437.
  • Cochran, W. G. (1976), “Early Development of Techniques in Comparative Experimentation,” in On the History of Statistics and Probability, New York: Dekker.
  • Cohen, J. (1994), “The Earth is Round (p <.05),” American Psychologist, 49, 997–1003.
  • Cowles, M., and Davis, C. (1982), “On the Origins of the.05 Level of Significance,” American Psychologist, 44, 1276–1284.
  • Cox, D. R. (1977), “The Role of Significance Tests,” Scandinavian Journal of Statistics, 4, 49–70.
  • Cox, D. R. (1982), “Statistical Significance Tests,” British Journal of Clinical Pharmacology, 14, 325–331. DOI:10.1111/j.1365-2125.1982.tb01987.x.
  • Cramer, H. (1955), The Elements of Probability Theory, New York: Wiley.
  • Drummond, G. (2015), “Most of the Time, P Is an Unreliable Marker, So We Need No Exact Cut-Off,” British Journal of Anaesthesia, 116, 894–894. DOI:10.1093/bja/aew146.
  • Edwards, W., Lindman, H., and Savage, L. J. (1963), “Bayesian Statistical Inference for Psychological Research,” Psychological Review, 70, 193. DOI:10.1037/h0044139.
  • Eysenck, H. J. (1960), “The Concept of Statistical Significance and the Controversy About One-Tailed Tests,” Psychological Review, 67, 269. DOI:10.1037/h0048412.
  • Fisher, R. A. (1926), “The Arrangement of Field Experiments,” Journal of the Ministry of Agriculture, 33, 503–513.
  • Fisher, R. A. (1956), Statistical Methods and Scientific Inference, New York: Hafner Publishing Co.
  • Freeman, P. R. (1993), “The Role of p-Values in Analysing Trial Results,” Statistics in Medicine, 12, 1443–1452.
  • Gelman, A. (2015), “The Connection Between Varying Treatment Effects and the Crisis of Unreplicable Research: A Bayesian Perspective,” Journal of Management, 41, 632–643. DOI:10.1177/0149206314525208.
  • Gelman, A. (2016), “The Problems With p-Values Are Not Just With p-Values,” The American Statistician, 70, 10 (supplemental material to the ASA statement on p-values and statistical significance).
  • Gelman, A. (2017), “The Failure of Null Hypothesis Significance Testing When Studying Incremental Changes, and What to do About It,” Personality and Social Psychology Bulletin, 44, 16–23. DOI:10.1177/0146167217729162.
  • Gelman, A., and Auerbach, J. (2016a), “Age-Aggregation Bias in Mortality Trends,” Proceedings of the National Academy of Sciences of the United States of America, 113, E816–E817. DOI:10.1073/pnas.1523465113.
  • Gelman, A., and Auerbach, J. (2016b), “Mortality Trends by Race/Ethnicity, Sex, Age and State,” Technical Report, Columbia University.
  • Gelman, A., and Carlin, J. (2014), “Beyond Power Calculations Assessing Type s (Sign) and Type m (magnitude) Errors,” Perspectives on Psychological Science, 9, 641–651. DOI:10.1177/1745691614551642.
  • Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., and Rubin, D. B. (2014), Bayesian Data Analysis (3rd ed.), Boca Raton, FL: Chapman and Hall/CRC.
  • Gelman, A., and Loken, E. (2014), “The Statistical Crisis in Science,” American Scientist, 102, 460–465. DOI:10.1511/2014.111.460.
  • Gelman, A., and Robert, C. P. (2014), “Revised Evidence for Statistical Standards,” Proceedings of the National Academy of Sciences of the United States of America, 111, E1933–E1933. DOI:10.1073/pnas.1322995111.
  • Gelman, A., and Stern, H. (2006), “The Difference Between ‘Significant’ and ‘Not Significant’ Is Not Itself Statistically Significant,” The American Statistician, 60, 328–331. DOI:10.1198/000313006X152649.
  • Gigerenzer, G. (1987). The Probabilistic Revolution. Vol. II: Ideas in the Sciences (Vol. II), Cambridge, MA: MIT Press.
  • Gigerenzer, G. (2004), “Mindless Statistics,” Journal of Socio-Economics, 33, 587–606. DOI:10.1016/j.socec.2004.09.033.
  • Gigerenzer, G. (2018), “Statistical Rituals: The Replication Delusion and How We Got There,” Advances in Methods and Practices in Psychological Science, 1, 198–218. DOI:10.1177/2515245918771329.
  • Gigerenzer, G., Krauss, S., and Vitouch, O. (2004), “The Null Ritual: What You Always Wanted to Know About Null Hypothesis Testing But Were Afraid to Ask,” in Handbook on Quantitative Methods in the Social Sciences, Thousand Oaks, CA: Sage Publications, Inc., pp. 389–406.
  • Gill, J. (1999), “The Insignificance of Null Hypothesis Significance Testing,” Political Research Quarterly, 52, 647–674. DOI:10.1177/106591299905200309.
  • Greenland, S. (2017), “Invited Commentary: The Need for Cognitive Science in Methodology,” American Journal of Epidemiology, 186, 639–646. DOI:10.1093/aje/kwx259.
  • Greenland, S., and Poole, C. (2013), “Living With Statistics in Observational Research,” Epidemiology, 24, 73–78. DOI:10.1097/EDE.0b013e3182785a49.
  • Haller, H., and Krauss, S. (2002), “Misinterpretations of Significance: a Problem Students Share With Their Teachers?,” Methods of Psychological Research, 7, 1–20, http://www.mpr-online.de.
  • Holman, C. J., Arnold-Reed, D. E., de Klerk, N., McComb, C., and English, D. R. (2001), “A Psychometric Experiment in Causal Inference to Estimate Evidential Weights Used by Epidemiologists,” Epidemiology, 12, 246–255. DOI:10.1097/00001648-200103000-00019.
  • Hubbard, R. (2004), “Alphabet Soup: Blurring the Distinctions Between p’s and α’s in Psychological Research,” Theory and Psychology, 14, 295–327. DOI:10.1177/0959354304043638.
  • Hubbard, R., and Lindsay, R. M. (2008), “Why p Values Are Not a Useful Measure of Evidence in Statistical Significance Testing,” Theory and Psychology, 18, 69–88. DOI:10.1177/0959354307086923.
  • Hunter, J. E. (1997), “Needed: A Ban on the Significance Test,” Psychological Science, 8, 3–7. DOI:10.1111/j.1467-9280.1997.tb00534.x.
  • Hurlbert, S. H., and Lombardi, C. M. (2009), “Final Collapse of the Neyman–Pearson Decision Theoretic Framework and Rise of the Neofisherian,” Annales Zoologici Fennici, 46, 311–349. DOI:10.5735/086.046.0501.
  • Ioannidis, J. P. A. (2005), “Why Most Published Research Findings Are False,” PLoS Medicine, 2, e124. DOI:10.1371/journal.pmed.0020124.
  • Johnson, V. E. (2013a), “Revised Standards for Statistical Evidence,” Proceedings of the National Academy of Sciences of the United States of America, 110, 19313–19317. DOI:10.1073/pnas.1313476110.
  • Johnson, V. E. (2013b), “Uniformly Most Powerful Bayesian Tests,” Annals of Statistics, 41, 1716–1741. DOI:10.1214/13-AOS1123.
  • Kamary, K., Mengersen, K., Robert, C., and Rousseau, J. (2014), “Testing Hypotheses as a Mixture Estimation Model,” Technical Report, https://arxiv.org/pdf/1214.4436.pdf.
  • Lehmann, E. L. (1993), Testing Statistical Hypotheses, New York: Chapman and Hall.
  • Lemoine, N. P., Hoffman, A., Felton, A. J., Baur, L., Chaves, F., Gray, J., Yu, Q., and Smith, M. D. (2016), “Underappreciated Problems of Low Replication in Ecological Field Studies,” Ecology, 97, 2554–2561. DOI:10.1002/ecy.1506.
  • McCloskey, D. N., and Ziliak, S. (1996), “The Standard Error of Regression,” Journal of Economic Literature, 34, 97–114.
  • McShane, B. B., and Böckenholt, U. (2014), “You Cannot Step Into the Same River Twice: When Power Analyses Are Optimistic,” Perspectives on Psychological Science, 9, 612–625. DOI:10.1177/1745691614548513.
  • McShane, B. B., and Böckenholt, U. (2017), “Single Paper Meta-Analysis: Benefits for Study Summary, Theory-Testing, and Replicability,” Journal of Consumer Research, 43, 1048–1063.
  • McShane, B. B., and Böckenholt, U. (2018), “Multilevel Multivariate Meta-Analysis With Application to Choice Overload,” Psychometrika, 83, 255–271. DOI:10.1007/s11336-017-9571-z.
  • McShane, B. B., and Gal, D. (2016), “Blinding Us to the Obvious? The Effect of Statistical Training on the Evaluation of Evidence,” Management Science, 62, 1707–1718. DOI:10.1287/mnsc.2015.2212.
  • McShane, B. B., and Gal, D. (2017), “Statistical Significance and the Dichotomization of Evidence,” Journal of the American Statistical Association, 112, 885–895. DOI:10.1080/01621459.2017.1289846.
  • Meehl, P. E. (1978), “Theoretical Risks and Tabular Asterisks: Sir Karl, Sir Ronald, and the Slow Progress of Soft Psychology,” Journal of Counseling and Clinical Psychology, 46, 806–834. DOI:10.1037/0022-006X.46.4.806.
  • Meehl, P. E. (1990), “Why Summaries of Research on Psychological Theories Are Often uninterpretable,” Psychological Reports, 66, 195–244. DOI:10.2466/pr0.1990.66.1.195.
  • Mitchell, S., Gelman, A., Ross, R., Chen, J., Bari, S., Huynh, U. K., Harris, M. W., Sachs, S. E., Stuart, E. A., Feller, A., and Makela, S. (2018), “The Millennium Villages Project: A Retrospective, Observational, Endline Evaluation,” The Lancet, 6, e500–e513. DOI:10.1016/S2214-109X(18)30065-2.
  • Morrison, D. E., and Henkel, R. E. (1970), The Significance Test Controversy, Chicago: Aldine.
  • Oakes, M. (1986), Statistical Inference: A Commentary for the Social and Behavioral Sciences, New York: Wiley.
  • Pericchi, L., Pereira, C. A., and Pérez, M.-E. (2014), “Adaptive Revised Standards for Statistical Evidence,” Proceedings of the National Academy of Sciences of the United States of America, 111, E1935–E1935. DOI:10.1073/pnas.1322191111.
  • Resnick, B. (2017), “What a Nerdy Debate About p-Values Shows About Science—And How to Fix It,” Technical Report.
  • Rosnow, R. L., and Rosenthal, R. (1989), “Statistical Procedures and the Justification of Knowledge in Psychological Science,” American Psychologist, 44, 1276–1284. DOI:10.1037/0003-066X.44.10.1276.
  • Rozenboom, W. W. (1960), “The Fallacy of the Null Hypothesis Significance Test,” Psychological Bulletin, 57, 416–428.
  • Sawyer, A. G., and Peter, J. P. (1983), “The Significance of Statistical Significance Tests in Marketing Research,” Journal of Marketing Research, 20, 122–133. DOI:10.1177/002224378302000203.
  • Schmidt, F. L. (1996), “Statistical Significance Testing and Cumulative Knowledge in Psychology: Implications for the Training of Researchers,” Psychological Methods, 1, 115–129. DOI:10.1037/1082-989X.1.2.115.
  • Senn, S. S. (2001), “Two Cheers for p-Values?,” Journal of Epidemiology and Biostatistics, 6, 193–204.
  • Serlin, R. C., and Lapsley, D. K. (1993), “Rational Appraisal Psychological Research and the Good Enough Principle,” in A Handbook for Data Analysis in the Behavioral Sciences: Methodological Issues, Hillsdale, NJ: Lawrence Erlbaum Associates.
  • Simmons, J. P., Nelson, L. D., and Simonsohn, U. (2011), “False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant,” Psychological Science, 22, 1359–1366. DOI:10.1177/0956797611417632.
  • Skipper, J. K., Guenther, A. L., and Nass, G. (1967), “The Sacredness of.05: A Note Concerning the Uses of Statistical Levels of Significance in Social Science,” The American Sociologist, 5, 16–18.
  • Smaldino, P. E., and McElreath, R. (2016), “The Natural Selection of Bad Science,” Technical Report, https://arxiv.org/pdf/1605.09511v1.pdf.
  • Tackett, J. L., Kushner, S. C., Herzhoff, K., Smack, A. J., and Reardon, K. W. (2014), “Viewing Relational Aggression Through Multiple Lenses: Temperament, Personality, and Personality Pathology,” Development and Psychopathology, 26, 863–877. DOI:10.1017/S0954579414000443.
  • Trangucci, R., Ali, I., Gelman, A., and Rivers, D. (2018), “Voting Patterns in 2016: Exploration Using Multilevel Regression and Poststratifi-cation (MRP) on Pre-Election Polls,” arXiv preprint arXiv:1802.00842.
  • Tukey, J. W. (1991), “The Philosophy of Multiple Comparisons,” Statistical Science, 6, 100–116. DOI:10.1214/ss/1177011945.
  • Wasserstein, R. L., and Lazar, N. A. (2016), “The ASA’s Statement on p-Values: Context, Process, and Purpose,” The American Statistician, 70, 129–133. DOI:10.1080/00031305.2016.1154108.
  • Yule, G. U., and Kendall, M. G. (1950), An Introduction to the Theory of Statistics (14th ed.), London: Griffin.

Appendix A

Problems With Uniformly Most Powerful Bayesian Tests

The mathematical justification underlying the Benjamin et al. (Citation2018) proposal has come under no small amount of criticism. Specifically, the UMPBTs that underlie the proposal were introduced and defended by Johnson (Citation2013b) in parallel with his call in Johnson (Citation2013a)—and now repeated in Benjamin et al. (Citation2018)—to use 0.005 as the new threshold. We see a number of concerns with UMPBTs.

First, and perhaps most relevant for the biomedical and social sciences, the UMPBT approach is deeply entrenched in the century-old Neyman–Pearson formalism of binary decisions and 0–1 loss functions. As Pericchi, Pereira, and Pérez (Citation2014) note, “the essence of the problem of classical testing of significance lies in its goal of minimizing Type II error (false negative) for a fixed Type I error (false positive).” While this formalism allows for mathematical optimization under some restricted collection of distributions and testing problems, it is quite rudimentary from a decision theoretical point of view, even to the extent of failing most purposes of conducting a sharp point null hypothesis test.

More specifically, the 0–1 loss function implicit in the NHST paradigm does not in general map, even in an approximate way, to processes of scientific learning or costs and benefits. Consequently, the logic underlying the proposal to move to a lower p-value threshold avoids firmly confronting the nature of the issue: any such rule implicitly expresses a particular tradeoff between Type I and Type II error, but in reality this tradeoff should depend on the costs, benefits, and probabilities of all outcomes (Gelman and Robert Citation2014) which in turn depend on the problem at hand and which vary tremendously across studies and domains. Instead, the UMPBT is based on a minimax prior that does not correspond to any distribution of effect sizes but rather represents a worst case scenario under a set of mathematical assumptions.

Second, there is no reason for non-Bayesians’ to adopt UMPBTs when they can instead rely on the standard Neyman–Pearson approach to uniformly most powerful (non-Bayesian) tests.

Third, defining the dependence of the procedure over a threshold (γ in the notation of Johnson (Citation2013b)) replicates the fundamental difficulty with the century-old Fisherian answer to hypothesis testing. To further seek a full agreement with the classical rejection region as advocated by Johnson (Citation2013b) is to simply negate the appeal of a truly Bayesian approach to this issue; moreover, this agreement is impossible to achieve for realistic statistical models.

Fourth, the construction of a UMPBT relies on the assumption of a “true” prior, which can be criticized in a vast majority of cases and which in any case moves one away from the Bayesian paradigm: with a single and “true” prior, the Bayesian model becomes an errors-in-variables model.

Fifth, the argument to maximize a probability for the Bayes factor to exceed a certain threshold also moves one away from the Bayesian paradigm because: (i) it ignores the motives for running the NHST and the subsequent steps taken in decision making or inference; (ii) it further negates any prior modeling of the alternative hypothesis aimed at separating the parameter space into regions of different (prior) likelihood; (iii) it does not condition upon the actual observations but instead integrates over the observation space and hence may fall afoul of the likelihood principle; (iv) it posits a single and fixed threshold γ for rejecting the null when there is no reason for γ not to depend on the observed data, as also argued above; (v) the maximization step eliminates the role of the prior distribution, as also argued above; (vi) in the rare one-dimensional settings where the maximization step can be conducted in closed form, the solution is a distribution with finite support; (vii) in the event the null hypothesis is rejected, the uniformly most powerful prior (or alt-prior) corresponding to the alternative cannot be used as such in subsequent inference but must instead be replaced with a regular prior over the whole parameter space—a strong violation of Bayesian coherence.

Sixth, speaking more generally, the concept of uniformly most powerful priors (and tests) does not easily extend to multivariate settings and even less to realistic cases that involve complex null hypotheses that contain nuisance parameters. The first solution proposed in Johnson (Citation2013b), to integrate out the nuisance parameters in the null hypothesis using a specific prior distribution, falls short of solving the issue of “objective Bayesian tests.” The second solution, namely to replace the unknown nuisance parameters with standard estimates, stands even farther from a Bayesian perspective.

Indeed, the Bayes factor itself is a consequence of the rudimentary Neyman–Pearson formalism, which as such caters to the issue of statistical significance. A discussion of the difficulties with this from a Bayesian perspective is provided in Kamary et al. (Citation2014), with a proposal of setting the hypothesis problem as one of mixture estimation.

Seventh, Johnson (Citation2013b) contains very little support for the asymptotic relevance of the approach, beyond the limiting normal distribution of the uniformly most powerful log Bayes factor and the convergence of the support to the “true” value of the parameter.

In closing, we note that many of our criticisms of the Johnson (Citation2013b) approach relate to the fact that it falls short of being truly Bayesian. However, we do not mean to say that hypothesis testing must be done in a Bayesian manner. Rather, we emphasize this because, to the extent that the Johnson (Citation2013b) approach loses its Bayesian connection, it also loses a Bayesian justification for the 0.005 rule. Consequently, 0.005 becomes just another arbitrary threshold, justified by some implicit tradeoff between false positives and negatives which we think does not make sense in any absolute and acontextual way.

Appendix B

Case Study

In the context of a hypothetical case study on the effects of sodium on blood pressure, we discuss how authors as well as editors and reviewers might follow our recommendation to demote the p-value from its threshold screening role and instead treat it continuously along with the currently subordinate factors—related prior evidence, plausibility of mechanism, study design and data quality, real world costs and benefits, novelty of finding, and other factors that vary by research domain—as just one among many pieces of evidence.

We recommend authors use the currently subordinate factors to motivate their data collection, statistical analysis, interpretation of results, writing, and related matters. In this example, the authors might consider related prior evidence that indicates the importance of blood pressure as a marker for healthy arteries, suggests the role of sodium in hemodynamics, and so forth. This evidence might also reveal a plausible mechanism, namely to excrete excess sodium the body must increase blood pressure.

In terms of study design and data quality, the authors might consider various possibilities for data collection. How should they recruit subjects? Should they randomize them to a low-sodium versus high-sodium diet? Or should they track them longitudinally, say via routine annual checkups over the course of years? Or is such data already available from some prior study? When and how often should sodium and blood pressure be measured? And how? The authors might measure sodium through a dietary recall questionnaire (noisy), through asking participants to maintain a food diary (somewhat less noisy), or through collection of urine to measure urinary sodium excretion (precise but restricted to a limited time point). Likewise, for blood pressure, they might rely on measurements conducted by someone convenient like friends or family members of the subjects who likely do not possess formal clinical training (noisy) or by paid clinicians instructed on the proper protocol for blood pressure measurement (precise but expensive).

Suppose the authors hypothesize a positive association between sodium consumption and high blood pressure. For the moment, let us assume that—while eschewing the NHST paradigm and the p-value thresholds intrinsic to it—the authors nonetheless perform a statistical analysis that results in a p-value. Further, let us assume they obtain a p-value of 0.001. How should this impact their interpretation of results, writing, and statistical decision making more broadly? Certainly, they have gained support for their hypothesis. However, can they conclude sodium is associated with—or even causes—high blood pressure as they would under the NHST paradigm?

Well, it would depend on the context and limitations of the study design and data quality. For example, supposing the study took place in Japan, perhaps the association exists in the Japanese subject population studied but does not in European populations whether because of some genetic differences between the two populations or because of some dietary differences (e.g., dietary sodium levels are much higher among Japanese so the association might not hold in levels typical among Europeans).

In terms of a causal interpretation, this would depend on related prior evidence, plausibility of mechanism, and study design and data quality. If prior studies show consistent and strong associations between sodium consumption and blood pressure, if evidence from physiological studies and animal models are consistent with an effect of sodium consumption on blood pressure, or if sodium levels were randomized, this increases the support for a causal role of sodium in increasing blood pressure.

Given, say, that the causal interpretation holds and holds broadly, the authors could then consider clinical significance, that is, real world costs and benefits. This depends not at all on a p-value but on the estimates of the magnitude of the effect—not only on blood pressure but also on downstream outcomes such as cardiovascular disease—as well as the uncertainty in them. It also depends on the costs of potential interventions such as lower sodium diets and drugs. They could also discuss novelty of finding in light of all of the above.

Now, let us assume they had instead obtained a p-value of 0.2. Can they conclude sodium is not associated with high blood pressure as they would under the NHST paradigm? Again, this would depend on all the factors discussed above. For example, perhaps the association does not exist in the Japanese population but does in European ones and so on.

There are two key points in this. First, results need not first have a p-value or some other purely statistical measure that attains some threshold before consideration is given to the currently subordinate factors. Instead, and as illustrated above, statistical measures should be considered along with the currently subordinate factors as just one among many pieces of evidence and should not take priority thereby yielding a more holistic view of the evidence. Second, statistical measures should be treated continuously in this more holistic view of the evidence. Specifically, a lower p-value constitutes continuously stronger evidence—and this holds regardless of the level of the p-value. Further, this continuously stronger evidence can be balanced along with the strengths and weaknesses of the currently subordinate factors in assessing the level of support for a hypothesis.

Of course, we believe not only that the authors’ statistical analysis should not be restricted to the NHST paradigm and the p-value thresholds intrinsic to it but also that it need not—and often should not—even result in a p-value (i.e., because it seldom makes sense to calibrate evidence as a function of the p-value). As noted, we recommend authors report all of their data and relevant results rather than focusing on single comparisons that attain some p-value or other statistical threshold. In this context, this might involve modeling the association between sodium and blood pressure as a function of additional health and dietary variables, demographic variables, and geography using a multilevel model. Such a model would not yield one single p-value thereby encouraging dichotomous declarations of truth or falsity—binary statements about there being “an effect” or “no effect.” Instead, it would yield many estimates that vary based on, for example, health and dietary variables, demographic variables, and geography, as well as the uncertainty in these estimates. Indeed, by accepting uncertainty and embracing variation in effects, the authors would uncover and present a much richer and more nuanced story about the association between sodium and blood pressure.

Turning to editors and reviewers, we recommend they explicitly evaluate papers with regard to not only purely statistical measures but also the currently subordinate factors. How might this work? We envision it would be rather similar to the above but in reverse. Specifically, editors and reviewers evaluating the authors’ paper on sodium and blood pressure would systemically assess, and possibly even indicate the weight they assign to, each of the following: How does the paper fit in with and build upon related prior evidence? Is the mechanism plausible? Are the study design and data quality sufficient to justify the conclusions? What are the implications in terms of real world costs and benefits? How novel are the findings? And, of course, how appropriate are the statistical analyses employed and how strong is the statistical support, whether in the form of a p-value or some other measure, resulting from these analyses?

In this more holistic view of the evidence, statistical measures are just one among many pieces of evidence considered by editors and reviewers and do not take priority. Of course, this does not mean that they cannot or will not strongly impact or alter their evaluation decisions. For example, in the context of the authors’ paper on sodium and blood pressure, strong statistical support, whether in the form of a low p-value or otherwise, for a finding that sodium consumption is associated with low blood pressure—the direction opposite of that indicated by prior evidence—in the context of a high quality study design featuring large samples and precise measurements might be deemed more novel and worthy of publication than if the statistical support had been weaker or if the finding was in the same direction as that indicated by prior evidence.

In sum, authors as well as reviewers and editors need not use statistical significance as a lexicographic decision rule. Results need not first have a p-value or some other purely statistical measure that attains some threshold before consideration is given to the currently subordinate factors. Instead, treated continuously, statistical measures should be considered along with the currently subordinate factors as just one among many pieces of evidence and should not take priority thereby yielding a more holistic view of the evidence.