Search in:

The American Statistician Volume 73, 2019 - Issue sup1: Statistical Inference in the 21st Century: A World Beyond p < 0.05

Submit an article Journal homepage

Open access

58,652

Views

510

CrossRef citations to date

Altmetric

Listen

Articles

Abandon Statistical Significance

Blakeley B. McShaneDepartment of Marketing, Kellogg School of Management, Northwestern University, Evanston, IL; Correspondence[email protected]

David GalDepartment of Managerial Studies, College of Business Administration, University of Illinois at Chicago, Chicago, IL;

Andrew GelmanDepartment of Statistics and Department of Political Science, Columbia University, New York, NY;

Christian RobertCentre de Recherche en Mathématiques de la Décision (CEREMADE), Université Paris-Dauphine, Paris, France;

Jennifer L. TackettDepartment of Psychology, Northwestern University, Evanston, IL

Pages 235-245 | Received 30 Oct 2017, Accepted 06 Sep 2018, Published online: 20 Mar 2019

Cite this article
https://doi.org/10.1080/00031305.2018.1527253
CrossMark

In this article

ABSTRACT
1 The Status Quo and Two Alternatives
2 Problems General to Null Hypothesis Significance Testing
3 Problems Specific to the Benjamin et al. () Proposal
4 Abandoning Statistical Significance
5 Discussion
Acknowledgements
References
Appendixes

Full Article
Figures & data
References
Citations
Metrics
Licensing
Reprints & Permissions
View PDF PDF View EPUB EPUB

ABSTRACT

We discuss problems the null hypothesis significance testing (NHST) paradigm poses for replication and more broadly in the biomedical and social sciences as well as how these problems remain unresolved by proposals involving modified p-value thresholds, confidence intervals, and Bayes factors. We then discuss our own proposal, which is to abandon statistical significance. We recommend dropping the NHST paradigm—and the p-value thresholds intrinsic to it—as the default statistical paradigm for research, publication, and discovery in the biomedical and social sciences. Specifically, we propose that the p-value be demoted from its threshold screening role and instead, treated continuously, be considered along with currently subordinate factors (e.g., related prior evidence, plausibility of mechanism, study design and data quality, real world costs and benefits, novelty of finding, and other factors that vary by research domain) as just one among many pieces of evidence. We have no desire to “ban” p-values or other purely statistical measures. Rather, we believe that such measures should not be thresholded and that, thresholded or not, they should not take priority over the currently subordinate factors. We also argue that it seldom makes sense to calibrate evidence as a function of p-values or other purely statistical measures. We offer recommendations for how our proposal can be implemented in the scientific publication process as well as in statistical decision making more broadly.

Keywords:

Null hypothesis significance testing
p-Value
Replication
Sociology of science
Statistical significance

1 The Status Quo and Two Alternatives

The biomedical and social sciences are facing a widespread crisis with published findings failing to replicate at an alarming rate. Often, such failures to replicate are associated with claims of huge effects from subtle, sometimes even preposterous, interventions. Further, the primary evidence adduced for these claims is one or more comparisons that are anointed “statistically significant”—typically defined as comparisons with p-values less than the conventional 0.05 threshold relative to the sharp point null hypothesis of zero effect and zero systematic error.

Indeed, the status quo is that p < 0.05 is deemed as strong evidence in favor of a scientific theory and is required not only for a result to be published but even for it to be taken seriously. Specifically, statistical significance serves as a lexicographic decision rule whereby any result is first required to have a p-value that attains the 0.05 threshold and only then is consideration—often scant—given to such factors as related prior evidence, plausibility of mechanism, study design and data quality, real world costs and benefits, novelty of finding, and other factors that vary by research domain (for want of a better term, we hereafter refer to these collectively as the currently subordinate factors).

Traditionally, the p < 0.05 rule has been considered a safeguard against noise-chasing and thus a guarantor of replicability. However, in recent years, a series of well-publicized examples (e.g., Carney, Cuddy, and Yap Citation2010; Bem Citation2011) coupled with theoretical work has made it clear that statistical significance can easily be obtained from pure noise. Consequently, low replication rates are to be expected given existing scientific practices (Ioannidis Citation2005; Smaldino and McElreath Citation2016), and calls for reform, which are not new (see, e.g., Meehl Citation1978), have become insistent.

One proposal, suggested by Benjamin and 71 coauthors including distinguished scholars from a wide variety of fields, is to redefine statistical significance, “to change the default p-value threshold for statistical significance for claims of new discoveries from 0.05 to 0.005” (Benjamin et al. Citation2018). While, as they note, “changing the p-value threshold is simple, aligns with the training undertaken by many researchers, and might quickly achieve broad acceptance,” we believe this “quick fix,” this “dam to contain the flood” in the words of a prominent member of the 72 (Resnick Citation2017), is insufficient to overcome current difficulties with replication. Instead, we believe it opportune to proceed immediately with other measures, perhaps more radical and more difficult but also more principled and more permanent.

In particular, we propose to abandon statistical significance, to drop the null hypothesis significance testing (NHST) paradigm—and the p-value thresholds intrinsic to it—as the default statistical paradigm for research, publication, and discovery in the biomedical and social sciences. Specifically, rather than allowing statistical significance as determined by p < 0.05 (or some other threshold whether based on p-values, confidence intervals, Bayes factors, or some other purely statistical measure) to serve as a lexicographic decision rule in scientific publication and statistical decision making more broadly, we propose that the p-value be demoted from its threshold screening role and instead, treated continuously, be considered along with the currently subordinate factors as just one among many pieces of evidence.

To be clear, we have no desire to “ban” p-values or other purely statistical measures. Rather, we believe that such measures should not be thresholded and that, thresholded or not, they should not take priority over the currently subordinate factors. We also argue that it seldom makes sense to calibrate evidence as a function of p-values or other purely statistical measures.

In the remainder of this article, we discuss general problems with NHST that motivate our proposal to abandon statistical significance and that remain unresolved by the Benjamin et al. (Citation2018) proposal. We then discuss problems specific to the Benjamin et al. (Citation2018) proposal. We next offer recommendations for how, in practice, the p-value can be demoted from its threshold screening role and instead be considered as just one among many pieces of evidence in the scientific publication process as well as in statistical decision making more broadly. We conclude with a brief discussion.

2 Problems General to Null Hypothesis Significance Testing

2.1 Preface

As noted, the NHST paradigm upon which the status quo and the Benjamin et al. (Citation2018) proposal rest is the default statistical paradigm for research, publication, and discovery in the biomedical and social sciences (see, e.g., Morrison and Henkel Citation1970; Sawyer and Peter Citation1983; Gigerenzer Citation1987; McCloskey and Ziliak Citation1996; Gill Citation1999; Anderson, Burnham, and Thompson Citation2000; Gigerenzer Citation2004; Hubbard Citation2004). Despite this, it has been roundly criticized both inside and outside of statistics over the decades (see, e.g., Rozenboom Citation1960; Bakan Citation1966; Meehl Citation1978; Serlin and Lapsley Citation1993; Cohen Citation1994; McCloskey and Ziliak Citation1996; Schmidt Citation1996; Hunter Citation1997; Gill Citation1999; Gigerenzer Citation2004; Gigerenzer, Krauss, and Vitouch Citation2004; Briggs Citation2016; McShane and Gal Citation2016, Citation2017). Indeed, the breadth of literature on this topic across time and fields makes a complete review intractable. Consequently, we focus on what we view as among the most important criticisms of NHST for the biomedical and social sciences.

2.2 Implausible Null Hypothesis

In the biomedical and social sciences, effects are typically small and vary considerably across people and contexts. In addition, measurements are often highly variable and only indirectly related to underlying constructs of interest; thus, even when sample sizes are large, the possibilities of systematic bias and variation can result in the equivalent of small or unrepresentative samples. Consequently, estimates from any single study—the typical fundamental unit of analysis—are themselves generally noisy.

In addition, the null hypothesis employed in the overwhelming majority of applications is the sharp point null hypothesis of zero effect—that is, no difference among two or more treatments or groups—and zero systematic error—which encompasses both the adequacy of the statistical model used to compute the p-value (e.g., in terms of functional form and distributional assumptions) as well as any and all forms of systematic or nonsampling error which vary by field but include measurement error; problems with reliability and validity; biased samples; nonrandom treatment assignment; missingness; nonresponse; failure of double-blinding; noncompliance; and confounding.

The combination of these features of the biomedical and social sciences and this sharp point null hypothesis of zero effect and zero systematic error is highly problematic. Specifically, because effects are generally small and variable, the assumption of zero effect is false. Further, even were the assumption of zero effect true for some phenomenon, the effect under consideration in any study designed to examine this phenomenon would not be zero because measurements are generally noisy and systematically so. Consequently, the sharp point null hypothesis of zero effect and zero systematic error employed in the overwhelming majority of applications is implausible (Berkson Citation1938; Edwards, Lindman, and Savage Citation1963; Bakan Citation1966; Meehl Citation1990; Tukey Citation1991; Cohen Citation1994; Gelman et al. Citation2014; McShane and Böckenholt Citation2014; Gelman Citation2015) and thus uninteresting.

These problems are exacerbated under a lexicographic decision rule for publication as per the status quo and the Benjamin et al. (Citation2018) proposal. Specifically, because noisy estimates that attain statistical significance are upwardly biased in magnitude (potentially to a large degree) and often of the wrong sign (Gelman and Carlin Citation2014), a lexicographic decision rule results in a tarnished literature. In addition, because many smaller, less resource-intensive, noisier studies are more likely to yield (or can be made more likely to yield; Simmons, Nelson, and Simonsohn Citation2011) one or more statistically significant results than fewer larger, more resource-intensive, better studies, a lexicographic decision rule at least indirectly encourages the former over the latter. These issues are compounded when researchers engage in multiple comparisons—whether actual or potential (i.e., the “garden of forking paths”; Gelman and Loken Citation2014).

In sum, various features of the biomedical and social sciences—for example, small and variable effects, systematic error, noisy measurements, a lexicographic decision rule for publication, and research practices—make NHST and in particular the sharp point null hypothesis of zero effect and zero systematic error particularly poorly suited for these domains.

2.3 Categorization of Evidence

NHST is associated with a number of problems related to the dichotomization of evidence into the different categories “statistically significant” and “not statistically significant” (or, sometimes, trichotomization with “marginally significant” as an intermediate category) depending upon where the p-value stands relative to certain conventional thresholds. Indeed, one well-known criticism of the NHST paradigm is that the conventional 0.05 threshold—or for that matter any other one—is entirely arbitrary (Fisher Citation1926; Yule and Kendall Citation1950; Cramer Citation1955; Cochran Citation1976; Cowles and Davis Citation1982).

A related line of criticism suggests that the problem is with having a threshold in the first place: the dichotomization (or trichotomization) of evidence into different categories of statistical significance itself has “no ontological basis” (Rosnow and Rosenthal Citation1989). Specifically, Rosnow and Rosenthal (Citation1989) note that “from an ontological viewpoint, there is no sharp line between a ‘significant’ and a ‘nonsignificant’ difference; significance in statistics…varies continuously between extremes” and thus advocate “view[ing] the strength of evidence for or against the null as a fairly continuous function of the magnitude of p.”

While we agree treating the p-value continuously rather than in a thresholded manner constitutes an improvement, we go further and argue that it seldom makes sense to calibrate evidence as a function of the p-value. We hold this for at least three reasons. First, and in our view the most important, it seldom makes sense because the p-value is, in the overwhelming majority of applications, defined relative to the generally implausible and uninteresting sharp point null hypothesis of zero effect and zero systematic error. Second, because it is a poor measure of the evidence for or against a statistical hypothesis (Edwards, Lindman, and Savage Citation1963; Berger and Sellke Citation1987; Cohen Citation1994; Hubbard and Lindsay Citation2008). Third, because it tests the hypothesis that one or more model parameters equal the tested values—but only given all other model assumptions. These other assumptions—in particular, zero systematic error—seldom hold (or are at least far from given) in the biomedical and social sciences. Consequently, “a small p-value only signals that there may be a problem with at least one assumption, without saying which one. Asymmetrically, a large p-value only means that this particular test did not detect a problem—perhaps because there is none, or because the test is insensitive to the problems, or because biases and random errors largely canceled each other out” (Greenland Citation2017). We note similar considerations hold for other purely statistical measures.

2.4 Erroneous Scientific Reasoning

The NHST paradigm and the p-value thresholds intrinsic to it are not only problematic in and of themselves but also they routinely result in erroneous scientific reasoning. For example, researchers typically take the rejection of the sharp point null hypothesis of zero effect and zero systematic error as positive or even definitive evidence in favor of some preferred alternative hypothesis—a logical fallacy. In addition, they often make scientific conclusions largely if not entirely based on whether or not a p-value crosses the 0.05 threshold instead of taking a more holistic view of the evidence that includes the consideration of the currently subordinate factors. Further, they often confuse statistical significance and practical importance (see, e.g., Freeman Citation1993). Finally, they often incorrectly believe a result with a p-value below 0.05 is evidence that a relationship is causal (Holman et al. Citation2001).

In addition, because the assignment of evidence to different categories (e.g., statistically significant and not statistically significant) is a strong inducement to the conclusion that the items thusly assigned are categorically different, NHST encourages researchers to engage in dichotomous thinking, that is, to interpret evidence dichotomously rather than continuously. Specifically, researchers interpret evidence that reaches the conventional threshold for statistical significance as a demonstration of a difference, and, in contrast, they interpret evidence that fails to reach this threshold as a demonstration of no difference.

An example of erroneous reasoning resulting from dichotomous thinking is provided by Gelman and Stern (Citation2006) who show that applied researchers often fail to appreciate that “the difference between ‘significant’ and ‘not significant’ is not itself statistically significant.” Additional examples are provided by McShane and Gal (Citation2016) who show that researchers across a wide variety of fields including medicine, epidemiology, cognitive science, psychology, and economics (i) interpret p-values dichotomously rather than continuously, focusing solely on whether or not the p-value is below 0.05 rather than the magnitude of the p-value; (ii) fixate on p-values even when they are irrelevant, for example, when asked about descriptive statistics; and (iii) ignore other evidence, for example, the magnitude of treatment differences. McShane and Gal (Citation2017) show that even statisticians are susceptible to these errors.

2.5 Misinterpretation of the p-Value

A final criticism against the NHST paradigm pertains to common misinterpretations of the p-value. While formally defined as the probability of observing data as extreme or more extreme than that actually observed assuming the null hypothesis is true, the p-value has often been misinterpreted as, inter alia, (i) the probability that the null hypothesis is true, (ii) one minus the probability that the alternative hypothesis is true, or (iii) one minus the probability of replication. For example, Gigerenzer (Citation2004) reports an example of research conducted on psychology professors, lecturers, teaching assistants, and students. Subjects were given the result of a simple t-test of two independent means (t = 2.7, df = 18, p = 0.01) and were asked six true or false questions based on the result and designed to test common misinterpretations of the p-value. All six of the statements were false and, despite the fact that the study materials noted “several or none of the statements may be correct,” (i) none of the 45 students, (ii) only four of the 39 professors and lectures who did not teach statistics, and (iii) only six of the 30 professors and lectures who did teach statistics marked all as false (members of each group marked an average of 3.5, 4.0, and 4.1 statements, respectively, as false). For related results, see Oakes (Citation1986); Cohen (Citation1994); Haller and Krauss (Citation2002); Gigerenzer (Citation2018).

3 Problems Specific to the Benjamin et al. (Citation2018) Proposal

Beyond concerns about the NHST paradigm upon which the status quo and the Benjamin et al. (Citation2018) proposal rest, there are additional problems specific to the latter proposal. First, Benjamin et al. (Citation2018) propose the 0.005 threshold because it (i) “corresponds to Bayes factors between approximately 14 and 26” in favor of the alternative hypothesis and (ii) “would reduce the false positive rate to levels we judge to be reasonable.” However, little to no justification is provided for either of these choices of levels.

Second, Benjamin et al. (Citation2018) “restrict [their] recommendation to claims of discovery of new effects” which is problematic for at least two reasons. First, the proposed policy is rendered entirely impractical because they fail to define what constitutes a new effect; this is especially so in domains where research is believed to be incremental and cumulative. Second, the proposed policy would lead to incoherence when applied to replication—the very issue their proposal is meant to address. In particular, the order in which two independent studies of a common phenomenon are conducted ought to be irrelevant but is not under the Benjamin et al. (Citation2018) proposal. Specifically, given one study with p < 0.005 and another with p ∈ (0.005,0.05), it would matter crucially which study was conducted first (and thus was “new”) under the definition of replication employed in practice (i.e., a subsequent study is considered to successfully replicate a prior study if either both fail to attain statistical significance or both attain statistical significance and are directionally consistent): the second (replication) study would be deemed a success under the Benjamin et al. (Citation2018) proposal if the first study was the p < 0.005 study but a failure otherwise.

Third, the fact that uncorrected multiple comparisons—both actual and potential—are the norm in applied research strictly speaking invalidates all p-values outside those from studies with preregistered protocols and data analysis procedures. This concern is acknowledged by Benjamin et al. (Citation2018). Nonetheless, what goes unacknowledged is that even with preregistration, p-values can be invalidated if the underlying model that generated the p-value is misspecified in an important manner.

Fourth, the mathematical justification underlying the Benjamin et al. (Citation2018) proposal has come under no small amount of criticism. Specifically, the uniformly most powerful Bayesian tests (UMPBTs) that underlie the proposal were introduced and defended by Johnson (Citation2013b) in parallel with his call in Johnson (Citation2013a)—and now repeated in Benjamin et al. (Citation2018)—to use 0.005 as the new threshold. We see a number of concerns with UMPBTs that we discuss in Appendix A. Perhaps most relevant for the biomedical and social sciences, the UMPBT approach is deeply entrenched in the century-old Neyman–Pearson formalism of binary decisions and 0–1 loss functions which does not in general map, even in an approximate way, to processes of scientific learning or costs and benefits. Consequently, the logic underlying the proposal to move to a lower p-value threshold avoids firmly confronting the nature of the issue: any such rule implicitly expresses a particular tradeoff between Type I and Type II error, but in reality this tradeoff should depend on the costs, benefits, and probabilities of all outcomes (Gelman and Robert Citation2014) which in turn depend on the problem at hand and which vary tremendously across studies and domains.

More speculatively, we are not convinced the more stringent 0.005 threshold for statistical significance would be helpful. In the short term, it could reduce the flow of low quality work that is currently polluting even top journals. In the medium term, it could motivate researchers to perform higher-quality work that is more likely to crack the 0.005 barrier. On the other hand, it could lead to even more overconfidence in results that do get published as well as a concomitant greater exaggeration of the effect sizes associated with such results. It could also lead to the discounting of important findings that happen not to reach the more stringent threshold. In sum, we have no idea whether implementation of the proposed 0.005 threshold would improve or degrade the state of science as we can envision both positive and negative outcomes resulting from it. Ultimately, while this question may be interesting if difficult to answer, we view it as outside our purview because we believe that thresholds whether based on p-values or other purely statistical measures are a bad idea.

Perhaps curiously, we do not necessarily expect that Benjamin et al. (Citation2018) would disagree with our criticism that their proposal is insufficient to overcome current difficulties with replication (or perhaps even with our own proposal to abandon statistical significance). After all, they “restrict [their] recommendation to claims of discovery of new effects” and recognize that “the choice of any particular threshold is arbitrary” and “should depend on the prior odds that the null hypothesis is true, the number of hypotheses tested, the study design, the relative cost of Type I versus Type II errors, and other factors that vary by research topic.” Indeed, “many of [the authors] agree that there are better approaches to statistical analyses than null hypothesis significance testing.”

4 Abandoning Statistical Significance

4.1 Summation and Recommendations

What can be done? Statistics is hard, especially when effects are small and variable and measurements are noisy. There are no quick fixes. Proposals such as changing the default p-value threshold for statistical significance, employing confidence intervals with a focus on whether or not they contain zero, or employing Bayes factors along with conventional classifications for evaluating the strength of evidence suffer from the same or similar issues as the current use of p-values with the 0.05 threshold. In particular, each implicitly or explicitly categorizes evidence based on thresholds relative to the generally implausible and uninteresting sharp point null hypothesis of zero effect and zero systematic error. Further, each is a purely statistical measure that fails to take a more holistic view of the evidence that includes the consideration of the currently subordinate factors, that is, related prior evidence, plausibility of mechanism, study design and data quality, real world costs and benefits, novelty of finding, and other factors that vary by research domain.

In brief, each is a form of statistical alchemy that falsely promises to transmute randomness into certainty, an “uncertainty laundering” (Gelman Citation2016) that begins with data and concludes with dichotomous declarations of truth or falsity—binary statements about there being “an effect” or “no effect”—based on some p-value or other statistical threshold being attained. A critical first step forward is to begin accepting uncertainty and embracing variation in effects (Carlin Citation2016; Gelman Citation2016) and recognizing that we can learn much (indeed, more) about the world by forsaking the false promise of certainty offered by such dichotomization.

Toward this end, we offer recommendations for how, in practice, the p-value be demoted from its threshold screening role and instead, treated continuously, be considered along with the currently subordinate factors as just one among many pieces of evidence. First, we recommend authors use the currently subordinate factors to motivate their data collection, statistical analysis, interpretation of results, writing, and related matters; we also recommend they analyze and report all of their data and relevant results. Second, we recommend editors and reviewers explicitly evaluate papers with regard to not only purely statistical measures but also the currently subordinate factors.

As a highly interdisciplinary research team with representation from statistics, political science, psychology, and consumer behavior, we are acutely aware that the implementation of our broad recommendations will and ought to vary tremendously across—and even within—domains. Further, we are not so supercilious to believe that we, by ourselves, are capable of providing concrete and specific guidance on the application of these recommendations across all or perhaps even in any of these domains. Indeed, we do not believe a “template” for our recommendations is possible or desirable. In fact, such a template could even be dangerous in that it might result in a rote and recipe-like application of our recommendations that would not be entirely dissimilar to, even if perhaps less harmful than, the current practice of rote and recipe-like application of NHST. To those who might argue that, without such a template, our recommendations are unrealistic or unlikely to be adopted in practice, we reiterate that statistics is hard and a formulaic approach to statistics is a principal cause of the current replication crisis. It is for these reasons we advocate this more radical and more difficult but also more principled and more permanent approach. Nonetheless, we below suggest some broad principles that show how our recommendations might be applied. We also provide a case study in Appendix B.

4.2 For Authors

We recommend authors use the currently subordinate factors to motivate their data collection, statistical analysis, interpretation of results, writing, and related matters; we also recommend they analyze and report all of their data and relevant results rather than focusing on single comparisons that attain some p-value or other statistical threshold.

One specific operationalization of the first part of our recommendation might be to include in their manuscripts a section that directly addresses how each of the currently subordinate factors motivated their various decisions regarding data collection, statistical analysis, interpretation of results, and writing in the context of the totality of the data and results. Such a section could, for example, discuss study design in the context of subject-matter knowledge and expectations of effect sizes as discussed by Gelman and Carlin (Citation2014). It could also discuss the plausibility of the mechanism by (i) formalizing the hypothesized mechanism for the effect in question and expounding on the various components of it, (ii) clarifying which components were measured and analyzed in the study, and (iii) discussing aspects of the results that support as well as those that undermine the hypothesized mechanism.

One might think that the second part of our recommendation —to analyze and report all of the data and relevant results—is such a fundamental principle of science that it need hardly be mentioned. However, this is not the case! As discussed above, the status quo in scientific publication is a lexicographic decision rule whereby p < 0.05 is virtually always required for a result to be published and, while there are some exceptions, standard practice is to focus on such results and to not report all relevant findings.

Given the current state of practice, authors may not have a sense for how they might go about this. Rather than attempt to provide broad guidance, we direct the reader to illustrations in clinical psychology (Tackett et al. Citation2014), epidemiology (Gelman and Auerbach Citation2016a,Citationb), political science (Trangucci et al. Citation2018), program evaluation (Mitchell et al. Citation2018), and social psychology and consumer behavior (McShane and Böckenholt Citation2017) as well as our case study in Appendix B.

4.3 For Editors and Reviewers

We recommend editors and reviewers explicitly evaluate papers with regard to not only purely statistical measures but also the currently subordinate factors; this should be far superior to the status quo, namely giving consideration—often scant—to the currently subordinate factors only once some p-value or other statistical threshold has been reached.

One specific operationalization of this recommendation might be to incorporate consideration of these factors into various stages of the review process. For example, editors could require reviewers to provide quantitative evaluations of each factor—including domain-specific factors determined by the editor—as well as an overall quantitative evaluation of the strength of evidence as a supplement to the current open-ended, qualitative evaluations. These could then be weighted by the editors’ publicly disclosed (or even reviewers’ own) importance rating of each factor. Additionally, editors could discuss and address the evaluation and importance of each factor in decision letters, thereby providing a more holistic view of the evidence.

One might object here and call our position naive: do not editors and reviewers require some bright-line threshold to decide whether the data supporting a claim is far enough from pure noise to support publication? Do not statistical thresholds provide objective standards for what constitutes evidence, and does this not in turn provide a valuable brake on the subjectivity and personal biases of editors and reviewers? Against these, we would argue that even were such a threshold needed, it would not make sense to set it based on the p-value given that it seldom makes sense to calibrate evidence as a function of this statistic and given that the costs and benefits of publishing noisy results varies by field. Additionally, the p-value is not a purely objective standard: different model specifications and statistical tests for the same data and null hypothesis yield different p-values; to complicate matters further, many subjective decisions regarding data protocols and analysis procedures such as coding and exclusion are required in practice and these often strongly impact the p-value ultimately reported. Finally, we fail to see why such a threshold screening rule is needed: editors and reviewers already make publication decisions one at a time based on qualitative factors, and this could continue to happen if the p-value were demoted from its threshold screening rule to just one among many pieces of evidence. Indeed, no single number—whether it be a p-value, Bayes factor, or some other statistical or nonstatistical measure—is capable of eliminating subjectivity and personal biases.

Instead, we believe it is entirely acceptable to publish a paper featuring a result with, say, a p-value of 0.2 or a 90% confidence interval that includes zero provided it is relevant to a theoretical or applied question of interest and the interpretation is sufficiently accurate. It should also be possible to publish a result with, say, a p-value of 0.001 without this being taken to imply the truth of some favored alternative hypothesis.

The p-value is relevant to the question of how easily a result could be explained by a particular null model, but there is no reason this should be the crucial factor in publication. A result can be consistent with a null model but still be relevant to science or policy debates, and a result can reject a null model without offering anything of scientific interest or policy relevance.

In sum, editors and reviewers can and should feel free to accept papers and present readers with the relevant evidence. We would much rather see a paper that, for example, states that there is weak evidence for an interesting finding but that existing data remain consistent with null effects than for the publication process to screen out such findings or encourage authors to cheat to obtain statistical significance.

4.4 Abandoning Statistical Significance Outside Scientific Publishing

While our focus has been on statistical significance thresholds in scientific publication, similar issues arise in other areas of statistical decision making, including, for example, neuroimaging where researchers use voxelwise NHSTs to decide which results to report or take seriously; medicine where regulatory agencies such as the Food and Drug Administration use NHSTs to decide whether or not to approve new drugs; policy analysis where nongovernmental and other organizations use NHSTs to determine whether interventions are beneficial or not; and business where managers use NHSTs to make binary decisions via A/B tests. In addition, thresholds arise not just around scientific publication but also within research projects, for example, when researchers use NHSTs to decide which avenues to pursue further based on preliminary findings.

While considerations around taking a more holistic view of the evidence and consequences of decisions are rather different across each of these settings and different from those in scientific publication, we nonetheless believe our proposal to demote the p-value from its threshold screening role and emphasize the currently subordinate factors applies in these settings. For example, in neuroimaging, the voxelwise NHST approach misses the point in that there are typically no true zeros and changes are generally happening at all brain locations at all times. Plotting images of estimates and uncertainties makes sense to us, but we see no advantage in using a threshold.

For regulatory, policy, and business decisions, cost-benefit calculations seem clearly superior to acontextual statistical thresholds. Specifically, and as noted, such thresholds implicitly express a particular tradeoff between Type I and Type II error, but in reality this tradeoff should depend on the costs, benefits, and probabilities of all outcomes.

That said, we acknowledge that thresholds—of a nonstatistical variety—may sometimes be useful in these settings. For example, consider a firm contemplating sending a costly offer to customers. Suppose the firm has a customer-level model of the revenue expected in response to the offer. In this setting, it could make sense for the firm to send the offer only to customers that yield an expected profit greater than some threshold, say, zero.

Even in pure research scenarios where there is no obvious cost-benefit calculation—for example, a comparison of the underlying mechanisms, as opposed to the efficacy, of two drugs used to treat some disease—we see no value in p-value or other statistical thresholds. Instead, we would like to see researchers simply report results: estimates, standard errors, confidence intervals, etc., with statistically inconclusive results being relevant for motivating future research.

While we see the intuitive appeal of using p-value or other statistical thresholds as a screening device to decide what avenues (e.g., ideas, drugs, or genes) to pursue further, this approach fundamentally does not make efficient use of data: there is in general no connection between a p-value—a probability based on a particular null model—and either the potential gains from pursuing a potential research lead or the predictive probability that the lead in question will ultimately be successful. Instead, to the extent that decisions do need to be made about which lines of research to pursue further, we recommend making such decisions using a model of the distribution of effect sizes and variation, thus working directly with hypotheses of interest rather than reasoning indirectly from a null model.

We would also like to see—when possible in these and other settings—more precise individual-level measurements, a greater use of within-person or longitudinal designs, and increased consideration of models that use informative priors, that feature varying treatment effects, and that are multilevel or meta-analytic in nature (Gelman Citation2015, Citation2017; McShane and Böckenholt Citation2017, Citation2018).

4.5 Getting From Here to There

How do we get from here—NHST, deterministic summaries, overconfidence in results, and statistical analysis focused on reporting just some of the data—to there—statistical analysis and interpretation of results that accepts uncertainty and embraces variation and that features full reporting of results rather than focusing on whatever happens to exceed some statistical threshold?

We have offered the recommendations that we believe will serve researchers best. However, we recognize that research takes place within an institutional structure that often encourages behavior that is counter to these recommendations. Researchers respond to the expectations of funding agencies in study design and editors and reviewers in writing. Conversely, funding agencies must choose among the submissions they receive and editors can only publish papers that are submitted to them. A careful research proposal that openly grapples with uncertainty may unfortunately lose out in the funding competition to a more traditional proposal that blithely promises 80% power based on selected and overestimated effect sizes. Similarly, a paper that presents all the data without making inappropriate claims of certainty may not get published in a journal that also receives submissions in which statistically significant results are presented at face value.

These institutional problems are difficult and we do not propose solutions to them. We imagine improvement will come in fits and starts, in several parallel tracks, all of which we and others have tried to contribute to in our applied and methodological research: improved statistical methods that move beyond NHST and include multilevel modeling, machine learning, statistical graphics, and other tools for analyzing and visualizing large amounts of data; applied examples using these improved methods, thereby demonstrating that it is possible to perform successful statistical analyses without aiming for deterministic results; theoretical work on the statistical effects of selection based on statistical significance and other decision criteria; and criticism of published work with gross overestimates of effect sizes or inappropriate claims of certainty. While we recognize change will likely require institutional reform including major modifications of current practices of funding agencies and editors and reviewers, we are also optimistic that some combination of recognition of error and awareness of alternatives can also motivate change.

5 Discussion

In this article, we have proposed to abandon statistical significance and offered recommendations for how this can be implemented in the scientific publication process as well as in statistical decision making more broadly. We reiterate that we have no desire to “ban” p-values or other purely statistical measures. Rather, we believe that such measures should not be thresholded and that, thresholded or not, they should not take priority over the currently subordinate factors.

While our proposal to abandon statistical significance may seem on the surface quite radical, at least one aspect of it—to treat p-values or other purely statistical measures continuously rather than in a thresholded manner—most certainly is not. Indeed, this was advocated by Fisher himself (Fisher Citation1956; Greenland and Poole Citation2013) as well as by other early and eminent statisticians including Pearson (Hurlbert and Lombardi Citation2009), Cox (Cox Citation1977, Citation1982), and Lehmann (Lehmann Citation1993; Senn Citation2001). It has also been advocated outside of statistics over the decades (see, e.g., Boring Citation1919; Eysenck Citation1960; Skipper, Guenther, and Nass Citation1967) and recently (see, e.g., Drummond Citation2015; Lemoine et al. Citation2016; Amrhein, Korner-Nievergelt, and Roth Citation2017; Greenland Citation2017; Amrhein and Greenland Citation2018). Finally, it is fully consistent with the recent American Statistical Association (ASA) Statement on Statistical Significance and p-values (“Principle 3: Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold;” Wasserstein and Lazar Citation2016). In sum, this aspect of our proposal is part of a long literature both inside and outside of statistics over the decades that stands in direct opposition to the threshold-based status quo and the Benjamin et al. (Citation2018) proposal.

Where our proposal might move beyond this literature is in three ways. First, we suggest that p-values or other purely statistical measures, thresholded or not, should not take priority over the currently subordinate factors (that said, others too have emphasized this including the recent ASA Statement which advises that “researchers should bring many contextual factors into play to derive scientific inferences, including the design of a study, the quality of the measurements, the external evidence for the phenomenon under study, and the validity of assumptions that underlie the data analysis” and cautions that “no single index should substitute for scientific reasoning;” Wasserstein and Lazar Citation2016). Second, as discussed above, while we believe treating the p-value continuously rather than in a thresholded manner constitutes an improvement, we go further and argue that it seldom makes sense to calibrate evidence as a function of the p-value or other purely statistical measures. Third, we offer recommendations for authors as well as editors and reviewers for how our proposal to abandon statistical significance can be implemented in the scientific publication process as well as in statistical decision making more broadly.

Our recommendations will not themselves resolve the replication crisis, but we believe they will have the salutary effect of pushing researchers away from the pursuit of irrelevant statistical targets and toward understanding of theory, mechanism, and measurement. We also hope they will push them to move beyond the paradigm of routine “discovery,” and binary statements about there being “an effect” or “no effect,” to one of continuous and inevitably flawed learning that is accepting of uncertainty and variation.

Acknowledgment

We thank the National Science Foundation, the Institute for Education Sciences, and the Office of Naval Research for partial support of Andrew Gelman’s work.

Related Research Data

The Other Arbitrary Cutoff

Source: eScholarship, University of California

Testing a Point Null Hypothesis: The Irreconcilability of P Values and Evidence

Source: JSTOR

Uncertainty

Source: Springer International Publishing

Invited Commentary: The Need for Cognitive Science in Methodology.

Source: Oxford University Press (OUP)

Redefine statistical significance

Source: Springer Science and Business Media LLC

Most of the time, P is an unreliable marker, so we need no exact cut-off

Source: Elsevier BV

Null hypothesis testing: problems, prevalence, and an alternative

Source: JSTOR

Mindless statistics

Source: Elsevier BV

The philosophy of multiple comparisons

Source: The Institute of Mathematical Statistics

Remove, rather than redefine, statistical significance.

Source: Springer Science and Business Media LLC

Final Collapse of the Neyman-Pearson Decision Theoretic Framework and Rise of the neoFisherian

Source: Finnish Zoological and Botanical Publishing Board

Adaptive revised standards for statistical evidence

Source: Proceedings of the National Academy of Sciences

Evidence of the Effect of Police Violence on Citizen Crime Reporting

Source: SAGE Publications

Registered Replication Report on Fischer, Castel, Dodd, and Pratt (2003):

Source: HAL CCSD

Statistical Significance and the Dichotomization of Evidence

Source: Informa UK Limited

Allostatic Load Indices With Cholesterol and Triglycerides Predict Disease and Mortality Risk in Zoo-Housed Western Lowland Gorillas (Gorilla gorilla gorilla).

Source: SAGE Publications

The earth is round (p < .05)

Source: American Psychological Association (APA)

Assessing Type S (Sign) and Type M (Magnitude) Errors

Source: SAGE Publications

Statistical significance gives bias a free pass.

Source: Wiley

Is the replication crisis a problem for biologists? A geometric morphometric approach

Source: Cold Spring Harbor Laboratory

Living with Statistics in Observational Research

Source: Ovid Technologies (Wolters Kluwer Health)

Philosophy of Science and The Replicability Crisis

Source: Wiley

Statistical significance tests.

Source: Wiley

Average Power: A Cautionary Note:

Source: SAGE Publications

The Significance of Statistical Significance Tests in Marketing Research

Source: SAGE Publications

Landscape anthropization shapes the survival of a top avian scavenger

Source: Springer Nature

The Insignificance of Null Hypothesis Significance Testing

Source: SAGE Publications

Two cheers for P-values?

Source: Informa UK Limited

PLUS ÇA CHANGE, PLUS C'EST LA MÊME CHOSE

Source: Fundação Getulio Vargas

A Bayesian Perspective

Source: SAGE Publications

Statistical Significance Testing and Cumulative Knowledge in Psychology: Implications for Training of Researchers

Source: American Psychological Association (APA)

Statistical Procedures and the Justification of Knowledge in Psychological Science

Source: American Psychological Association (APA)

Bayesian statistical inference for psychological research.

Source: American Psychological Association (APA)

Needed: A Ban on the Significance Test

Source: SAGE Publications

Revised evidence for statistical standards

Source: Proceedings of the National Academy of Sciences

Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant

Source: SAGE Publications

Post-hoc Bayesian Hypothesis Tests in Epistemic Network Analyses

Source: Springer International Publishing

Mathematical vs. scientific significance.

Source: Zenodo

Towards replicability with confidence intervals for the exceedance probability.

Source: arXiv

Blinding Us to the Obvious? The Effect of Statistical Training on the Evaluation of Evidence

Source: Institute for Operations Research and the Management Sciences (INFORMS)

The Null Ritual: What You Always Wanted to Know About Significance Testing but Were Afraid to Ask

Source: SAGE Publications, Inc.

The statistical crisis in science

Source: Sigma Xi

Signal content index (SCI): a measure of the effectiveness of measurements and an alternative to p -value for comparing two means

Source: IOP Publishing

Monetary and Social Rewards for Crowdsourcing

Source: Multidisciplinary Digital Publishing Institute

When Power Analyses Are Optimistic

Source: SAGE Publications

The earth is flat (p < 0.05): significance thresholds and the crisis of unreplicable research

Source: PeerJ

Multilevel Multivariate Meta-analysis with Application to Choice Overload.

Source: Springer Science and Business Media LLC

Moving to a World Beyond “p < 0.05”

Source: Informa UK Limited

The ASA's Statement on p-Values: Context, Process, and Purpose

Source: Informa UK Limited

The Millennium Villages Project: a retrospective, observational, endline evaluation.

Source: Elsevier BV

Why Summaries of Research on Psychological Theories are Often Uninterpretable

Source: SAGE Publications

How to run an experimental auction: a review of recent advances

Source: Oxford University Press (OUP)

Inferential statistics as descriptive statistics: there is no replication crisis if we don't expect replication

Source: Informa UK Limited

Some Difficulties of Interpretation Encountered in the Application of the Chi-Square Test

Source: Informa UK Limited

Statistical Hypothesis Testing: Overview and Application

Source: Wiley

Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and the slow progress of soft psychology

Source: Elsevier BV

Elicitation of prior probability distributions for a proposed Bayesian randomized clinical trial of whole blood for trauma resuscitation

Source: Wiley

Blurring the Distinctions Betweenp’s anda’s in Psychological Research

Source: SAGE Publications

The Difference Between “Significant” and “Not Significant” is not Itself Statistically Significant

Source: Informa UK Limited

Why P Values Are Not a Useful Measure of Evidence in Statistical Significance Testing

Source: SAGE Publications

Effects of single cage housing in the rat and mouse pilocarpine models of epilepsy

Source: Society for Neuroscience

Neyman–Pearson lemma for Bayes factors

Source: Informa UK Limited

To Flex or Rest: Does Adding No-Load Isometric Actions to the Inter-Set Rest Period in Resistance Training Enhance Muscular Adaptations? A Randomized-Controlled Trial.

Source: Frontiers Media SA

Feeling the Future: Experimental Evidence for Anomalous Retroactive Influences on Cognition and Affect

Source: British Psychological Society

Causally Denoise Word Embeddings Using Half-Sibling Regression

Source: arXiv

With Little Power Comes Great Responsibility

Source: Association for Computational Linguistics

Rejoinder: Statistical Significance and the Dichotomization of Evidence

Source: Informa UK Limited

Enhancing Statistical Inference in Psychological Research via Prospective and Retrospective Design Analysis

Source: Unpublished

Null hypothesis significance testing interpreted and calibrated by estimating probabilities of sign errors: A Bayes-frequentist continuum

Source: HAL CCSD

Trialling meta-research in comparative cognition: claims and statistical inference in animal physical cognition

Source: Animal Behavior and Cognition

Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and the slow progress of soft psychology.

Source: American Psychological Association (APA)

Code replicability in computer graphics

Source: Association for Computing Machinery (ACM)

Safe Testing

Source: arXiv

Single-paper meta-analysis: Benefits for study summary, theory testing, and replicability.

Source: Oxford University Press (OUP)

Significance and Replication in simple counting experiments: Distributional Null Hypothesis Testing

Source: arXiv

The Elements of Probability Theory and Some of its Applications

Source: AIP Publishing

On the Origins of the .05 Level of Statistical Significance

Source: American Psychological Association (APA)

An Introduction to the Theory of Statistics.

Source: Oxford University Press (OUP)

Comparing lesion segmentation methods in multiple sclerosis: Input from one manually delineated subject is sufficient for accurate lesion segmentation.

Source: Elsevier BV

Soil-borne fungi influence seed germination and mortality, with implications for coexistence of desert winter annual plants.

Source: Public Library of Science (PLoS)

Increasing Transparency Through a Multiverse Analysis

Source: SAGE Publications

The Arrangement of Field Experiments

Source: Springer New York

Testing hypotheses via a mixture estimation model

Source: HAL CCSD

Rapid measurement of the adult worker population size in honey bees

Source: HAL CCSD

Is it Possible to Disregard Obsolete Requirements? A Family of Experiments in Software Effort Estimation

Source: arXiv

Semantic and cognitive tools to aid statistical science: replace confidence and significance by compatibility and surprise.

Source: BMC

The test of significance in psychological research.

Source: American Psychological Association (APA)

Threats of a replication crisis in empirical computer science

Source: HAL CCSD

Enhancing Rigor in Quantitative Entrepreneurship Research

Source: Wiley

Understanding the role of diabetes in the osteoarthritis disease and treatment process: a study protocol for the Swedish Osteoarthritis and Diabetes (SOAD) cohort.

Source: BMJ Publishing Group

Selection of Exponential-Family Random Graph Models via Held-Out Predictive Evaluation (HOPE)

Source: arXiv

False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant

Source: American Psychological Association (APA)

Demystify Lindley's Paradox by Interpreting P-value as Posterior Probability

Source: arXiv

Misspecification and unreliable interpretations in psychology and social science.

Source: arXiv

Reproducibility and Replication of Experimental Particle Physics Results

Source: arXiv

Linking provided by

References

Amrhein, V., and Greenland, S. (2018), “Remove, Rather Than Redefine, Statistical Significance,” Nature Human Behaviour, 2, 4. DOI:10.1038/s41562-017-0224-0.
PubMed Web of Science ®Google Scholar
Amrhein, V., Korner-Nievergelt, F., and Roth, T. (2017). “The Earth is Flat (p > 0.05): Significance Thresholds and the Crisis of Unreplicable Research,” PeerJ, 5, e3544. DOI:10.7717/peerj.3544.
PubMed Web of Science ®Google Scholar
Anderson, D. R., Burnham, K. P., and Thompson, W. L. (2000), “Null Hypothesis Testing: Problems, Prevalence, and an Alternative,” Journal of Wildlife Management, 64, 912–923. DOI:10.2307/3803199.
Web of Science ®Google Scholar
Bakan, D. (1966), “The Test of Significance in Psychological Research,” Psychological Bulletin, 66(6), 423–437. DOI:10.1037/h0020412.
PubMed Web of Science ®Google Scholar
Bem, D. J. (2011), “Feeling the Future: Experimental Evidence for Anomalous Retroactive Influences on Cognition and Affect,” Journal of Personality and Social Psychology, 100, 407–425. DOI:10.1037/a0021524.
PubMed Web of Science ®Google Scholar
Benjamin, D. J., Berger, J. O., Johannesson, M., Nosek, B. A., Wagenmakers, E. J., Berk, R., Bollen, K. A., Brembs, B., Brown, L., Camerer, C., and Cesarini, D. (2018), “Redefine Statistical Significance,” Nature Human Behaviour, 2, 6–10. DOI:10.1038/s41562-017-0189-z.
PubMed Web of Science ®Google Scholar
Berger, J. O., and Sellke, T. (1987), “Testing a Point Null Hypothesis: The Irreconciliability of p Values and Evidence,” Journal of the American Statistical Association, 82, 112–122. DOI:10.2307/2289131.
Web of Science ®Google Scholar
Berkson, J. (1938), “Some Difficulties of Interpretation Encountered in the Application of the Chi-Square Test,” Journal of the American Statistical Association, 33, 526–536. DOI:10.1080/01621459.1938.10502329.
Google Scholar
Boring, E. G. (1919), “Mathematical vs. Scientific Significance,” Psychological Bulletin, 16, 335–338. DOI:10.1037/h0074554.
Google Scholar
Briggs, W. M. (2016), Uncertainty: The Soul of Modeling, Probability and Statistics, New York: Springer.
Google Scholar
Carlin, J. B. (2016), “Is Reform Possible Without a Paradigm Shift?” The American Statistician, 901, 10 (supplemental material to the ASA statement on p-values and statistical significance).
Google Scholar
Carney, D. R., Cuddy, A. J., and Yap, A. J. (2010), “Power Posing: Brief Nonverbal Displays Affect Neuroendocrine Levels and Risk Tolerance,” Psychological Science, 21, 1363–1368. DOI:10.1177/0956797610383437.
PubMed Web of Science ®Google Scholar
Cochran, W. G. (1976), “Early Development of Techniques in Comparative Experimentation,” in On the History of Statistics and Probability, New York: Dekker.
Google Scholar
Cohen, J. (1994), “The Earth is Round (p <.05),” American Psychologist, 49, 997–1003.
Web of Science ®Google Scholar
Cowles, M., and Davis, C. (1982), “On the Origins of the.05 Level of Significance,” American Psychologist, 44, 1276–1284.
Google Scholar
Cox, D. R. (1977), “The Role of Significance Tests,” Scandinavian Journal of Statistics, 4, 49–70.
Web of Science ®Google Scholar
Cox, D. R. (1982), “Statistical Significance Tests,” British Journal of Clinical Pharmacology, 14, 325–331. DOI:10.1111/j.1365-2125.1982.tb01987.x.
PubMed Web of Science ®Google Scholar
Cramer, H. (1955), The Elements of Probability Theory, New York: Wiley.
Google Scholar
Drummond, G. (2015), “Most of the Time, P Is an Unreliable Marker, So We Need No Exact Cut-Off,” British Journal of Anaesthesia, 116, 894–894. DOI:10.1093/bja/aew146.
Web of Science ®Google Scholar
Edwards, W., Lindman, H., and Savage, L. J. (1963), “Bayesian Statistical Inference for Psychological Research,” Psychological Review, 70, 193. DOI:10.1037/h0044139.
Web of Science ®Google Scholar
Eysenck, H. J. (1960), “The Concept of Statistical Significance and the Controversy About One-Tailed Tests,” Psychological Review, 67, 269. DOI:10.1037/h0048412.
PubMed Web of Science ®Google Scholar
Fisher, R. A. (1926), “The Arrangement of Field Experiments,” Journal of the Ministry of Agriculture, 33, 503–513.
Google Scholar
Fisher, R. A. (1956), Statistical Methods and Scientific Inference, New York: Hafner Publishing Co.
Google Scholar
Freeman, P. R. (1993), “The Role of p-Values in Analysing Trial Results,” Statistics in Medicine, 12, 1443–1452.
PubMed Web of Science ®Google Scholar
Gelman, A. (2015), “The Connection Between Varying Treatment Effects and the Crisis of Unreplicable Research: A Bayesian Perspective,” Journal of Management, 41, 632–643. DOI:10.1177/0149206314525208.
Web of Science ®Google Scholar
Gelman, A. (2016), “The Problems With p-Values Are Not Just With p-Values,” The American Statistician, 70, 10 (supplemental material to the ASA statement on p-values and statistical significance).
Google Scholar
Gelman, A. (2017), “The Failure of Null Hypothesis Significance Testing When Studying Incremental Changes, and What to do About It,” Personality and Social Psychology Bulletin, 44, 16–23. DOI:10.1177/0146167217729162.
PubMed Web of Science ®Google Scholar
Gelman, A., and Auerbach, J. (2016a), “Age-Aggregation Bias in Mortality Trends,” Proceedings of the National Academy of Sciences of the United States of America, 113, E816–E817. DOI:10.1073/pnas.1523465113.
PubMed Web of Science ®Google Scholar
Gelman, A., and Auerbach, J. (2016b), “Mortality Trends by Race/Ethnicity, Sex, Age and State,” Technical Report, Columbia University.
Google Scholar
Gelman, A., and Carlin, J. (2014), “Beyond Power Calculations Assessing Type s (Sign) and Type m (magnitude) Errors,” Perspectives on Psychological Science, 9, 641–651. DOI:10.1177/1745691614551642.
Web of Science ®Google Scholar
Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., and Rubin, D. B. (2014), Bayesian Data Analysis (3rd ed.), Boca Raton, FL: Chapman and Hall/CRC.
Google Scholar
Gelman, A., and Loken, E. (2014), “The Statistical Crisis in Science,” American Scientist, 102, 460–465. DOI:10.1511/2014.111.460.
Web of Science ®Google Scholar
Gelman, A., and Robert, C. P. (2014), “Revised Evidence for Statistical Standards,” Proceedings of the National Academy of Sciences of the United States of America, 111, E1933–E1933. DOI:10.1073/pnas.1322995111.
PubMed Web of Science ®Google Scholar
Gelman, A., and Stern, H. (2006), “The Difference Between ‘Significant’ and ‘Not Significant’ Is Not Itself Statistically Significant,” The American Statistician, 60, 328–331. DOI:10.1198/000313006X152649.
Web of Science ®Google Scholar
Gigerenzer, G. (1987). The Probabilistic Revolution. Vol. II: Ideas in the Sciences (Vol. II), Cambridge, MA: MIT Press.
Google Scholar
Gigerenzer, G. (2004), “Mindless Statistics,” Journal of Socio-Economics, 33, 587–606. DOI:10.1016/j.socec.2004.09.033.
Google Scholar
Gigerenzer, G. (2018), “Statistical Rituals: The Replication Delusion and How We Got There,” Advances in Methods and Practices in Psychological Science, 1, 198–218. DOI:10.1177/2515245918771329.
Google Scholar
Gigerenzer, G., Krauss, S., and Vitouch, O. (2004), “The Null Ritual: What You Always Wanted to Know About Null Hypothesis Testing But Were Afraid to Ask,” in Handbook on Quantitative Methods in the Social Sciences, Thousand Oaks, CA: Sage Publications, Inc., pp. 389–406.
Google Scholar
Gill, J. (1999), “The Insignificance of Null Hypothesis Significance Testing,” Political Research Quarterly, 52, 647–674. DOI:10.1177/106591299905200309.
Web of Science ®Google Scholar
Greenland, S. (2017), “Invited Commentary: The Need for Cognitive Science in Methodology,” American Journal of Epidemiology, 186, 639–646. DOI:10.1093/aje/kwx259.
PubMed Web of Science ®Google Scholar
Greenland, S., and Poole, C. (2013), “Living With Statistics in Observational Research,” Epidemiology, 24, 73–78. DOI:10.1097/EDE.0b013e3182785a49.
PubMed Web of Science ®Google Scholar
Haller, H., and Krauss, S. (2002), “Misinterpretations of Significance: a Problem Students Share With Their Teachers?,” Methods of Psychological Research, 7, 1–20, http://www.mpr-online.de.
Google Scholar
Holman, C. J., Arnold-Reed, D. E., de Klerk, N., McComb, C., and English, D. R. (2001), “A Psychometric Experiment in Causal Inference to Estimate Evidential Weights Used by Epidemiologists,” Epidemiology, 12, 246–255. DOI:10.1097/00001648-200103000-00019.
PubMed Web of Science ®Google Scholar
Hubbard, R. (2004), “Alphabet Soup: Blurring the Distinctions Between p’s and α’s in Psychological Research,” Theory and Psychology, 14, 295–327. DOI:10.1177/0959354304043638.
Web of Science ®Google Scholar
Hubbard, R., and Lindsay, R. M. (2008), “Why p Values Are Not a Useful Measure of Evidence in Statistical Significance Testing,” Theory and Psychology, 18, 69–88. DOI:10.1177/0959354307086923.
Web of Science ®Google Scholar
Hunter, J. E. (1997), “Needed: A Ban on the Significance Test,” Psychological Science, 8, 3–7. DOI:10.1111/j.1467-9280.1997.tb00534.x.
Web of Science ®Google Scholar
Hurlbert, S. H., and Lombardi, C. M. (2009), “Final Collapse of the Neyman–Pearson Decision Theoretic Framework and Rise of the Neofisherian,” Annales Zoologici Fennici, 46, 311–349. DOI:10.5735/086.046.0501.
Web of Science ®Google Scholar
Ioannidis, J. P. A. (2005), “Why Most Published Research Findings Are False,” PLoS Medicine, 2, e124. DOI:10.1371/journal.pmed.0020124.
PubMed Web of Science ®Google Scholar
Johnson, V. E. (2013a), “Revised Standards for Statistical Evidence,” Proceedings of the National Academy of Sciences of the United States of America, 110, 19313–19317. DOI:10.1073/pnas.1313476110.
PubMed Web of Science ®Google Scholar
Johnson, V. E. (2013b), “Uniformly Most Powerful Bayesian Tests,” Annals of Statistics, 41, 1716–1741. DOI:10.1214/13-AOS1123.
PubMed Web of Science ®Google Scholar
Kamary, K., Mengersen, K., Robert, C., and Rousseau, J. (2014), “Testing Hypotheses as a Mixture Estimation Model,” Technical Report, https://arxiv.org/pdf/1214.4436.pdf.
Google Scholar
Lehmann, E. L. (1993), Testing Statistical Hypotheses, New York: Chapman and Hall.
Google Scholar
Lemoine, N. P., Hoffman, A., Felton, A. J., Baur, L., Chaves, F., Gray, J., Yu, Q., and Smith, M. D. (2016), “Underappreciated Problems of Low Replication in Ecological Field Studies,” Ecology, 97, 2554–2561. DOI:10.1002/ecy.1506.
PubMed Web of Science ®Google Scholar
McCloskey, D. N., and Ziliak, S. (1996), “The Standard Error of Regression,” Journal of Economic Literature, 34, 97–114.
Web of Science ®Google Scholar
McShane, B. B., and Böckenholt, U. (2014), “You Cannot Step Into the Same River Twice: When Power Analyses Are Optimistic,” Perspectives on Psychological Science, 9, 612–625. DOI:10.1177/1745691614548513.
Web of Science ®Google Scholar
McShane, B. B., and Böckenholt, U. (2017), “Single Paper Meta-Analysis: Benefits for Study Summary, Theory-Testing, and Replicability,” Journal of Consumer Research, 43, 1048–1063.
Web of Science ®Google Scholar
McShane, B. B., and Böckenholt, U. (2018), “Multilevel Multivariate Meta-Analysis With Application to Choice Overload,” Psychometrika, 83, 255–271. DOI:10.1007/s11336-017-9571-z.
PubMed Web of Science ®Google Scholar
McShane, B. B., and Gal, D. (2016), “Blinding Us to the Obvious? The Effect of Statistical Training on the Evaluation of Evidence,” Management Science, 62, 1707–1718. DOI:10.1287/mnsc.2015.2212.
Web of Science ®Google Scholar
McShane, B. B., and Gal, D. (2017), “Statistical Significance and the Dichotomization of Evidence,” Journal of the American Statistical Association, 112, 885–895. DOI:10.1080/01621459.2017.1289846.
Web of Science ®Google Scholar
Meehl, P. E. (1978), “Theoretical Risks and Tabular Asterisks: Sir Karl, Sir Ronald, and the Slow Progress of Soft Psychology,” Journal of Counseling and Clinical Psychology, 46, 806–834. DOI:10.1037/0022-006X.46.4.806.
Web of Science ®Google Scholar
Meehl, P. E. (1990), “Why Summaries of Research on Psychological Theories Are Often uninterpretable,” Psychological Reports, 66, 195–244. DOI:10.2466/pr0.1990.66.1.195.
Web of Science ®Google Scholar
Mitchell, S., Gelman, A., Ross, R., Chen, J., Bari, S., Huynh, U. K., Harris, M. W., Sachs, S. E., Stuart, E. A., Feller, A., and Makela, S. (2018), “The Millennium Villages Project: A Retrospective, Observational, Endline Evaluation,” The Lancet, 6, e500–e513. DOI:10.1016/S2214-109X(18)30065-2.
PubMed Web of Science ®Google Scholar
Morrison, D. E., and Henkel, R. E. (1970), The Significance Test Controversy, Chicago: Aldine.
Google Scholar
Oakes, M. (1986), Statistical Inference: A Commentary for the Social and Behavioral Sciences, New York: Wiley.
Google Scholar
Pericchi, L., Pereira, C. A., and Pérez, M.-E. (2014), “Adaptive Revised Standards for Statistical Evidence,” Proceedings of the National Academy of Sciences of the United States of America, 111, E1935–E1935. DOI:10.1073/pnas.1322191111.
PubMed Web of Science ®Google Scholar
Resnick, B. (2017), “What a Nerdy Debate About p-Values Shows About Science—And How to Fix It,” Technical Report.
Google Scholar
Rosnow, R. L., and Rosenthal, R. (1989), “Statistical Procedures and the Justification of Knowledge in Psychological Science,” American Psychologist, 44, 1276–1284. DOI:10.1037/0003-066X.44.10.1276.
Web of Science ®Google Scholar
Rozenboom, W. W. (1960), “The Fallacy of the Null Hypothesis Significance Test,” Psychological Bulletin, 57, 416–428.
PubMed Web of Science ®Google Scholar
Sawyer, A. G., and Peter, J. P. (1983), “The Significance of Statistical Significance Tests in Marketing Research,” Journal of Marketing Research, 20, 122–133. DOI:10.1177/002224378302000203.
Web of Science ®Google Scholar
Schmidt, F. L. (1996), “Statistical Significance Testing and Cumulative Knowledge in Psychology: Implications for the Training of Researchers,” Psychological Methods, 1, 115–129. DOI:10.1037/1082-989X.1.2.115.
Web of Science ®Google Scholar
Senn, S. S. (2001), “Two Cheers for p-Values?,” Journal of Epidemiology and Biostatistics, 6, 193–204.
PubMedGoogle Scholar
Serlin, R. C., and Lapsley, D. K. (1993), “Rational Appraisal Psychological Research and the Good Enough Principle,” in A Handbook for Data Analysis in the Behavioral Sciences: Methodological Issues, Hillsdale, NJ: Lawrence Erlbaum Associates.
Google Scholar
Simmons, J. P., Nelson, L. D., and Simonsohn, U. (2011), “False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant,” Psychological Science, 22, 1359–1366. DOI:10.1177/0956797611417632.
PubMed Web of Science ®Google Scholar
Skipper, J. K., Guenther, A. L., and Nass, G. (1967), “The Sacredness of.05: A Note Concerning the Uses of Statistical Levels of Significance in Social Science,” The American Sociologist, 5, 16–18.
Google Scholar
Smaldino, P. E., and McElreath, R. (2016), “The Natural Selection of Bad Science,” Technical Report, https://arxiv.org/pdf/1605.09511v1.pdf.
Google Scholar
Tackett, J. L., Kushner, S. C., Herzhoff, K., Smack, A. J., and Reardon, K. W. (2014), “Viewing Relational Aggression Through Multiple Lenses: Temperament, Personality, and Personality Pathology,” Development and Psychopathology, 26, 863–877. DOI:10.1017/S0954579414000443.
PubMed Web of Science ®Google Scholar
Trangucci, R., Ali, I., Gelman, A., and Rivers, D. (2018), “Voting Patterns in 2016: Exploration Using Multilevel Regression and Poststratifi-cation (MRP) on Pre-Election Polls,” arXiv preprint arXiv:1802.00842.
Google Scholar
Tukey, J. W. (1991), “The Philosophy of Multiple Comparisons,” Statistical Science, 6, 100–116. DOI:10.1214/ss/1177011945.
Google Scholar
Wasserstein, R. L., and Lazar, N. A. (2016), “The ASA’s Statement on p-Values: Context, Process, and Purpose,” The American Statistician, 70, 129–133. DOI:10.1080/00031305.2016.1154108.
Web of Science ®Google Scholar
Yule, G. U., and Kendall, M. G. (1950), An Introduction to the Theory of Statistics (14th ed.), London: Griffin.
Google Scholar

Appendix A

Problems With Uniformly Most Powerful Bayesian Tests

The mathematical justification underlying the Benjamin et al. (Citation2018) proposal has come under no small amount of criticism. Specifically, the UMPBTs that underlie the proposal were introduced and defended by Johnson (Citation2013b) in parallel with his call in Johnson (Citation2013a)—and now repeated in Benjamin et al. (Citation2018)—to use 0.005 as the new threshold. We see a number of concerns with UMPBTs.

First, and perhaps most relevant for the biomedical and social sciences, the UMPBT approach is deeply entrenched in the century-old Neyman–Pearson formalism of binary decisions and 0–1 loss functions. As Pericchi, Pereira, and Pérez (Citation2014) note, “the essence of the problem of classical testing of significance lies in its goal of minimizing Type II error (false negative) for a fixed Type I error (false positive).” While this formalism allows for mathematical optimization under some restricted collection of distributions and testing problems, it is quite rudimentary from a decision theoretical point of view, even to the extent of failing most purposes of conducting a sharp point null hypothesis test.

More specifically, the 0–1 loss function implicit in the NHST paradigm does not in general map, even in an approximate way, to processes of scientific learning or costs and benefits. Consequently, the logic underlying the proposal to move to a lower p-value threshold avoids firmly confronting the nature of the issue: any such rule implicitly expresses a particular tradeoff between Type I and Type II error, but in reality this tradeoff should depend on the costs, benefits, and probabilities of all outcomes (Gelman and Robert Citation2014) which in turn depend on the problem at hand and which vary tremendously across studies and domains. Instead, the UMPBT is based on a minimax prior that does not correspond to any distribution of effect sizes but rather represents a worst case scenario under a set of mathematical assumptions.

Second, there is no reason for non-Bayesians’ to adopt UMPBTs when they can instead rely on the standard Neyman–Pearson approach to uniformly most powerful (non-Bayesian) tests.

Third, defining the dependence of the procedure over a threshold (γ in the notation of Johnson (Citation2013b)) replicates the fundamental difficulty with the century-old Fisherian answer to hypothesis testing. To further seek a full agreement with the classical rejection region as advocated by Johnson (Citation2013b) is to simply negate the appeal of a truly Bayesian approach to this issue; moreover, this agreement is impossible to achieve for realistic statistical models.

Fourth, the construction of a UMPBT relies on the assumption of a “true” prior, which can be criticized in a vast majority of cases and which in any case moves one away from the Bayesian paradigm: with a single and “true” prior, the Bayesian model becomes an errors-in-variables model.

Fifth, the argument to maximize a probability for the Bayes factor to exceed a certain threshold also moves one away from the Bayesian paradigm because: (i) it ignores the motives for running the NHST and the subsequent steps taken in decision making or inference; (ii) it further negates any prior modeling of the alternative hypothesis aimed at separating the parameter space into regions of different (prior) likelihood; (iii) it does not condition upon the actual observations but instead integrates over the observation space and hence may fall afoul of the likelihood principle; (iv) it posits a single and fixed threshold γ for rejecting the null when there is no reason for γ not to depend on the observed data, as also argued above; (v) the maximization step eliminates the role of the prior distribution, as also argued above; (vi) in the rare one-dimensional settings where the maximization step can be conducted in closed form, the solution is a distribution with finite support; (vii) in the event the null hypothesis is rejected, the uniformly most powerful prior (or alt-prior) corresponding to the alternative cannot be used as such in subsequent inference but must instead be replaced with a regular prior over the whole parameter space—a strong violation of Bayesian coherence.

Sixth, speaking more generally, the concept of uniformly most powerful priors (and tests) does not easily extend to multivariate settings and even less to realistic cases that involve complex null hypotheses that contain nuisance parameters. The first solution proposed in Johnson (Citation2013b), to integrate out the nuisance parameters in the null hypothesis using a specific prior distribution, falls short of solving the issue of “objective Bayesian tests.” The second solution, namely to replace the unknown nuisance parameters with standard estimates, stands even farther from a Bayesian perspective.

Indeed, the Bayes factor itself is a consequence of the rudimentary Neyman–Pearson formalism, which as such caters to the issue of statistical significance. A discussion of the difficulties with this from a Bayesian perspective is provided in Kamary et al. (Citation2014), with a proposal of setting the hypothesis problem as one of mixture estimation.

Seventh, Johnson (Citation2013b) contains very little support for the asymptotic relevance of the approach, beyond the limiting normal distribution of the uniformly most powerful log Bayes factor and the convergence of the support to the “true” value of the parameter.

In closing, we note that many of our criticisms of the Johnson (Citation2013b) approach relate to the fact that it falls short of being truly Bayesian. However, we do not mean to say that hypothesis testing must be done in a Bayesian manner. Rather, we emphasize this because, to the extent that the Johnson (Citation2013b) approach loses its Bayesian connection, it also loses a Bayesian justification for the 0.005 rule. Consequently, 0.005 becomes just another arbitrary threshold, justified by some implicit tradeoff between false positives and negatives which we think does not make sense in any absolute and acontextual way.

Appendix B

Case Study

In the context of a hypothetical case study on the effects of sodium on blood pressure, we discuss how authors as well as editors and reviewers might follow our recommendation to demote the p-value from its threshold screening role and instead treat it continuously along with the currently subordinate factors—related prior evidence, plausibility of mechanism, study design and data quality, real world costs and benefits, novelty of finding, and other factors that vary by research domain—as just one among many pieces of evidence.

We recommend authors use the currently subordinate factors to motivate their data collection, statistical analysis, interpretation of results, writing, and related matters. In this example, the authors might consider related prior evidence that indicates the importance of blood pressure as a marker for healthy arteries, suggests the role of sodium in hemodynamics, and so forth. This evidence might also reveal a plausible mechanism, namely to excrete excess sodium the body must increase blood pressure.

In terms of study design and data quality, the authors might consider various possibilities for data collection. How should they recruit subjects? Should they randomize them to a low-sodium versus high-sodium diet? Or should they track them longitudinally, say via routine annual checkups over the course of years? Or is such data already available from some prior study? When and how often should sodium and blood pressure be measured? And how? The authors might measure sodium through a dietary recall questionnaire (noisy), through asking participants to maintain a food diary (somewhat less noisy), or through collection of urine to measure urinary sodium excretion (precise but restricted to a limited time point). Likewise, for blood pressure, they might rely on measurements conducted by someone convenient like friends or family members of the subjects who likely do not possess formal clinical training (noisy) or by paid clinicians instructed on the proper protocol for blood pressure measurement (precise but expensive).

Suppose the authors hypothesize a positive association between sodium consumption and high blood pressure. For the moment, let us assume that—while eschewing the NHST paradigm and the p-value thresholds intrinsic to it—the authors nonetheless perform a statistical analysis that results in a p-value. Further, let us assume they obtain a p-value of 0.001. How should this impact their interpretation of results, writing, and statistical decision making more broadly? Certainly, they have gained support for their hypothesis. However, can they conclude sodium is associated with—or even causes—high blood pressure as they would under the NHST paradigm?

Well, it would depend on the context and limitations of the study design and data quality. For example, supposing the study took place in Japan, perhaps the association exists in the Japanese subject population studied but does not in European populations whether because of some genetic differences between the two populations or because of some dietary differences (e.g., dietary sodium levels are much higher among Japanese so the association might not hold in levels typical among Europeans).

In terms of a causal interpretation, this would depend on related prior evidence, plausibility of mechanism, and study design and data quality. If prior studies show consistent and strong associations between sodium consumption and blood pressure, if evidence from physiological studies and animal models are consistent with an effect of sodium consumption on blood pressure, or if sodium levels were randomized, this increases the support for a causal role of sodium in increasing blood pressure.

Given, say, that the causal interpretation holds and holds broadly, the authors could then consider clinical significance, that is, real world costs and benefits. This depends not at all on a p-value but on the estimates of the magnitude of the effect—not only on blood pressure but also on downstream outcomes such as cardiovascular disease—as well as the uncertainty in them. It also depends on the costs of potential interventions such as lower sodium diets and drugs. They could also discuss novelty of finding in light of all of the above.

Now, let us assume they had instead obtained a p-value of 0.2. Can they conclude sodium is not associated with high blood pressure as they would under the NHST paradigm? Again, this would depend on all the factors discussed above. For example, perhaps the association does not exist in the Japanese population but does in European ones and so on.

There are two key points in this. First, results need not first have a p-value or some other purely statistical measure that attains some threshold before consideration is given to the currently subordinate factors. Instead, and as illustrated above, statistical measures should be considered along with the currently subordinate factors as just one among many pieces of evidence and should not take priority thereby yielding a more holistic view of the evidence. Second, statistical measures should be treated continuously in this more holistic view of the evidence. Specifically, a lower p-value constitutes continuously stronger evidence—and this holds regardless of the level of the p-value. Further, this continuously stronger evidence can be balanced along with the strengths and weaknesses of the currently subordinate factors in assessing the level of support for a hypothesis.

Of course, we believe not only that the authors’ statistical analysis should not be restricted to the NHST paradigm and the p-value thresholds intrinsic to it but also that it need not—and often should not—even result in a p-value (i.e., because it seldom makes sense to calibrate evidence as a function of the p-value). As noted, we recommend authors report all of their data and relevant results rather than focusing on single comparisons that attain some p-value or other statistical threshold. In this context, this might involve modeling the association between sodium and blood pressure as a function of additional health and dietary variables, demographic variables, and geography using a multilevel model. Such a model would not yield one single p-value thereby encouraging dichotomous declarations of truth or falsity—binary statements about there being “an effect” or “no effect.” Instead, it would yield many estimates that vary based on, for example, health and dietary variables, demographic variables, and geography, as well as the uncertainty in these estimates. Indeed, by accepting uncertainty and embracing variation in effects, the authors would uncover and present a much richer and more nuanced story about the association between sodium and blood pressure.

Turning to editors and reviewers, we recommend they explicitly evaluate papers with regard to not only purely statistical measures but also the currently subordinate factors. How might this work? We envision it would be rather similar to the above but in reverse. Specifically, editors and reviewers evaluating the authors’ paper on sodium and blood pressure would systemically assess, and possibly even indicate the weight they assign to, each of the following: How does the paper fit in with and build upon related prior evidence? Is the mechanism plausible? Are the study design and data quality sufficient to justify the conclusions? What are the implications in terms of real world costs and benefits? How novel are the findings? And, of course, how appropriate are the statistical analyses employed and how strong is the statistical support, whether in the form of a p-value or some other measure, resulting from these analyses?

In this more holistic view of the evidence, statistical measures are just one among many pieces of evidence considered by editors and reviewers and do not take priority. Of course, this does not mean that they cannot or will not strongly impact or alter their evaluation decisions. For example, in the context of the authors’ paper on sodium and blood pressure, strong statistical support, whether in the form of a low p-value or otherwise, for a finding that sodium consumption is associated with low blood pressure—the direction opposite of that indicated by prior evidence—in the context of a high quality study design featuring large samples and precise measurements might be deemed more novel and worthy of publication than if the statistical support had been weaker or if the finding was in the same direction as that indicated by prior evidence.

In sum, authors as well as reviewers and editors need not use statistical significance as a lexicographic decision rule. Results need not first have a p-value or some other purely statistical measure that attains some threshold before consideration is given to the currently subordinate factors. Instead, treated continuously, statistical measures should be considered along with the currently subordinate factors as just one among many pieces of evidence and should not take priority thereby yielding a more holistic view of the evidence.

Download PDF

Share icon
Back to Top

Related research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.

People also read
Recommended articles
Cited by

To cite this article:

Reference style: APA Chicago Harvard

Citation copied to clipboard

Reference styles above use APA (6th edition), Chicago (16th edition) & Harvard (10th edition)

Download citation

Download a citation file in RIS format that can be imported by citation management software including EndNote, ProCite, RefWorks and Reference Manager.

Choose format: RIS BibTex RefWorks Direct Export

Choose options: Citation Citation & abstract Citation & references

Your download is now in progress and you may close this window

Did you know that with a free Taylor & Francis Online account you can gain access to the following benefits?

Choose new content alerts to be informed about new research of interest to you
Easy remote access to your institution's subscriptions on any device, from any location
Save your searches and schedule alerts to send you new results
Export your search results into a .csv file to support your research

Have an account?
Login now Don't have an account?
Register for free

Login or register to access this feature

Have an account?
Login now Don't have an account?
Register for free

Choose new content alerts to be informed about new research of interest to you
Easy remote access to your institution's subscriptions on any device, from any location
Save your searches and schedule alerts to send you new results
Export your search results into a .csv file to support your research

Abandon Statistical Significance

ABSTRACT

1 The Status Quo and Two Alternatives

2 Problems General to Null Hypothesis Significance Testing

2.1 Preface

2.2 Implausible Null Hypothesis

2.3 Categorization of Evidence

2.4 Erroneous Scientific Reasoning

2.5 Misinterpretation of the p-Value

3 Problems Specific to the Benjamin et al. (Citation2018) Proposal

4 Abandoning Statistical Significance

4.1 Summation and Recommendations

4.2 For Authors

4.3 For Editors and Reviewers

4.4 Abandoning Statistical Significance Outside Scientific Publishing

4.5 Getting From Here to There

5 Discussion

Acknowledgment

Related Research Data

References

Appendix A

Problems With Uniformly Most Powerful Bayesian Tests

Appendix B

Case Study

Information for

Open access

Opportunities

Help and information

Abandon Statistical Significance

ABSTRACT

1 The Status Quo and Two Alternatives

2 Problems General to Null Hypothesis Significance Testing

2.1 Preface

2.2 Implausible Null Hypothesis

2.3 Categorization of Evidence

2.4 Erroneous Scientific Reasoning

2.5 Misinterpretation of the p-Value

3 Problems Specific to the Benjamin et al. (Citation2018) Proposal

4 Abandoning Statistical Significance

4.1 Summation and Recommendations

4.2 For Authors

4.3 For Editors and Reviewers

4.4 Abandoning Statistical Significance Outside Scientific Publishing

4.5 Getting From Here to There

5 Discussion

Acknowledgment

Related Research Data

References

Appendix A

Problems With Uniformly Most Powerful Bayesian Tests

Appendix B

Case Study

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date