13,917
Views
25
CrossRef citations to date
0
Altmetric
Articles

Putting the P-Value in its Place

& ORCID Icon
Pages 122-128 | Received 24 Jan 2018, Accepted 09 Apr 2018, Published online: 20 Mar 2019

ABSTRACT

As the debate over best statistical practices continues in academic journals, conferences, and the blogosphere, working researchers (e.g., psychologists) need to figure out how much time and effort to invest in attending to experts' arguments, how to design their next project, and how to craft a sustainable long-term strategy for data analysis and inference. The present special issue of The American Statistician promises help. In this article, we offer a modest proposal for a continued and informed use of the conventional p-value without the pitfalls of statistical rituals. Other statistical indices should complement reporting, and extra-statistical (e.g., theoretical) judgments ought to be made with care and clarity.

Dr. Livingstone, I presume!

∼ Stanley to Livingstone, after rejecting the hypothesis that Livingstone was not Livingstone

Walk barefoot on the beach; it gets the toxins out! ∼ A well-meaning neighbor

The most striking pair of facts about significance testing is its immense popularity among researchers, and the intensity of the critical opposition. Since modern sampling techniques and frequentist statistics became de rigueur in many fields of scientific study over half a century ago, critics have questioned the wisdom of significance testing in general and the reliance on p-values in particular (Gigerenzer Citation2004; Stigler Citation1999). Significance tests and p-values can be found in many sciences, and thus the context and the nature of the debate vary. In this article, we focus on research typical in experimental psychological science. We discuss statistical inference in the canonical case of the treatment-plus-control experiment designed to test a single, novel, and causal hypothesis with a limited (and often small) sample.

In the prototypical experimental scenario, a researcher who, after having done formal and informal (i.e., intuitive) theoretical work, comes to think that a particular stimulus or treatment might have an effect on participants' experience and behavior. To give a simple example, assume that the researcher thinks that taking a break from creative production increases the eventual total output. In the control condition, respondents generate as many diverse answers to a prompt such as “If all schools were abolished, what would you do to become educated?” as they can without taking a break. In the experimental condition, respondents take a short break during which they engage in a cognitively consuming task before resuming their creative efforts (Ostrowski Citation2017; see also Gilhooly, Georgiou, and Devery Citation2012). The statistical question is whether there is a significant difference in creative production between these two conditions.

Many experiments in psychological science take this general form. The properties of the data (e.g., their potential skew, the scale of measurement) help the researcher select an appropriate significance test (e.g., by choosing between parametric and nonparametric tests), compute a test statistic, and ascertain its p-value, that is, the probability of the observed test statistic—or any value more extreme than the observed one—assuming that the null hypothesis of no difference is true. Following convention, the researcher rejects the null hypothesis as being a poor fit with the data if p < 0.05. The cautious researcher infers that there is some (perhaps limited or preliminary) support for the ordinal and imprecise hypothesis that the treatment has an effect; in our example, taking a break from generating diverse responses increases the eventual total number of responses. In our experience, many researchers choose their words carefully when communicating what they have learned from a significant result. Few insist that they have proven their hypothesis, and many note that the data provide some evidence against the null hypothesis.

Yet, much of the criticism leveled at significance testing, and specifically its most common variant of null hypothesis significance testing, NHST, focuses on the limits of many researchers' knowledge of its logic, its proper use, and the meaning of its concepts (Goodman Citation2008; Greenland et al. Citation2016). This sort of criticism does not challenge the validity of NHST directly, but it invites the audience to consider alternatives. We prefer a clear distinction between the properties of a method and its (flawed) reception by its practitioners. Yet, while focusing on the former, we hope to educate practicing researchers on the latter.

In the first section of the remainder of this short article, we discuss what we consider the primary contested issue regarding the p-values produced by significance testing, which is their ability to support reverse inferences about the truth status of the tested hypothesis given the observed data. In the second section, we ask whether significance testing should be replaced tout court with Bayesian methods. In the third section, we return to significance testing and discuss its implications for different types of decision error and how these errors are viewed. In the fourth and final section, we explore the psychological and sociological context of the current statistical debate and its implications for the researcher in the lab.

1.1. The Predictive Power of p

Chief among the concerns about researchers' ignorance is that they mistake the p-value for the probability of the hypothesis given the data, p(H|D). An informal—but serviceable—interpretation of the p-value is that it is the probability of the data assuming that the (null) hypothesis is true, p(D|H) (Wasserstein and Lazar Citation2016). An answer to the question of whether the p-value is subtly different from p(D|H) lies beyond this article (Greenland et al. Citation2016). Many statisticians accept the idea of a close correspondence, while others insist that the differences are deep. Yet, most statisticians agree that to assume without checking that p(D|H) = p(H|D) is to commit a grave reasoning error, namely the fallacy of reverse inference (Krueger Citation2017). To claim that the null hypothesis is false with a probability of 0.95 because an experiment yielded p = 0.05 is to make an irresponsible statement.

The probability of the hypothesis given the data, p(H|D), is given by Bayes' theorem, which states that p(H|D)=p(H)×p(D|H)p(H)×p(D|H)+p(H)×p(D|H). The theorem shows that the p-value (or p(D|H)) predicts p(H|D) but that the prior probability of the hypothesis, p(H), also matters, as does the probability of the data under any hypothesis, p(D), which is shown in decomposed form in the denominator. The equality of p(H|D) = p(D|H) may occur, but only for the special case of p(H) = p(D) (Dawes Citation1988). Otherwise, the relationship between p(H|D) and p(D|H) is imperfect. Using simulation experiments, we mapped the range of possibilities for this relationship (see Krueger and Heck Citation2017, for details). We can summarize the two main findings thus: first, there is a wide range of positive correlations between the two conditional probabilities; under no circumstance did we find a negative or zero correlation. Second, these correlations are higher inasmuch as the simulated conditions reflect the typical context of empirical work. Specifically, it is typically not the case that p(H) and p(D|H) are conditionally independent over experiments. If an experiment is very risky, that is, if the probability of the null hypothesis, p(H), is very high, then it is also very unlikely to find a low p-value. The second-order probability of finding a low value for p(D|H) is low when p(H) is high. As a result, positive correlations between p(D|H) and p(H|D) can be high, even in excess of 0.7. In other words, under realistic conditions of the typical experimental research environment, p-values serve as useful heuristic cues for inferences about the posterior probability of the tested hypothesis, p(H|D).

Our analysis supports the following conclusion: Whereas it is good to warn against equating the p-value with the posterior of the tested hypothesis, it is unwise to suggest that no inference can be made. The absence of a one-to-one correspondence is not the absence of any correspondence. Indeed, the application of Bayes' theorem to the logic of NHST reveals that statistical inference is an inference under uncertainty and not a strict computation or logical implication. Hence, we regard the p-value as a useful heuristic cue for estimating the tested hypothesis's posterior probability. In Krueger and Heck (Citation2017), we describe estimation techniques that help researchers identify expected values of p(H|D) and their associated ranges. We encourage researchers to use p-values for first-pass, heuristic inferences.

1.2. The Bayesian Alternative

The foregoing argument is pragmatic rather than purist. It does not make a clean separation between Bayesian and frequentist approaches. Instead, we treat the observed p-value as input for the estimation of the posterior probability of the tested hypothesis. Absent additional or contravening assumptions, a researcher might favor (if not ‘‘accept’’) a hypothesis if it is more likely than turning up heads in a coin flip (p(H|D) > 0.5). Similar pragmatic integrations of Bayesian and frequentist operations have been proposed elsewhere (Cohen Citation1994; Krueger Citation2001; Nickerson Citation2000; Trafimow Citation2003). Much of the current debate, however, is characterized by committed advocacy, which asks researchers to endorse a particular set of principles. The desire for a full conversion of researchers to a particular school of thought is understandable, given that the advocates of these schools take pains to work strictly within their own set of assumptions, and given that mixing approaches can beget confusion and contraction (Gigerenzer Citation2004; Gigerenzer and Marewski Citation2015). Yet, in this particular case—estimating p(H|D) from p(D|H) using Bayes' theorem—we see little epistemological danger. Instead, we see an opportunity for researchers to better understand the properties of the p-value and its implications for the hypothesis of interest. In the spirit of this special issue on statistical inference, we advise researchers to familiarize themselves with the basic assumptions underlying contemporary schools of statistical thought and to treat available methods as tools in a box, to be used judiciously.

Consider the central assumption about what is perceived to vary and what is perceived to be a fixed condition. Frequentists treat hypotheses as fixed parameters; it may only be the null hypothesis, but it could also be a nonnull or substantive hypothesis, or it could be several hypotheses. Hence, analysis produces probabilities of data—which may vary due to sampling—and which are conditioned on the parameters (i.e., hypotheses). Bayesians, in contrast, treat the data as the conditions given by observation, while treating hypotheses as random variables, or distributed. Hence, Bayesians are interested in posterior probability distributions. If they select two discrete hypotheses, H and ∼H, from these distributions, they can compute odds ratios p(H|D)/p(∼H|D) or ratios of posterior over the prior odds (Ortega and Navarrete Citation2017). If they consider an entire distribution of hypotheses (i.e., where the hypothesis is a random variable), an infinite number of odds ratios awaits contemplation.

We suspect that many researchers find the Bayesian theory (theory = a way of seeing) of hypotheses as forming a distribution of possibilities counterintuitive—and not only because they have been trained in frequentist statistics. It seems more natural to ask if a particular idea or hypothesis is true or false than it is to consider a set of observations (the data) and represent knowledge and belief as a distribution over an infinite set of hypothesis. The latter does not allow a person to express the strength of belief in any particular idea but only to express how much stronger or weaker the belief is relative to some other particular idea. In contrast, the former (hypothesis-conditioned) way of thinking is aligned with any epistemological framework that begins with core theoretical assumptions, generates testable hypotheses, and ends with experimental data and inferences.

We are inclined—as we foreshadowed in Section 1—to recommend significance testing in small-sample experimentation and amend it with Bayesian inferences to estimate p(H|D). The p-value, or p(D|H), is a useful heuristic to infer p(H|D). The ratio of p(D|H)/p(D|∼H) may be a stronger heuristic, but becomes relevant only if researchers articulate a specific alternative to the null hypothesis. The ratio p(D|H)/p(D|∼H) is a likelihood ratio, LR, when it is computed with the values of the density function. The use of probabilities (areas under the density curve) and use of likelihoods (the value on the y-axis of the density function) makes little difference. Our simulations corroborate what a look at a unimodal distribution reveals, namely that both probabilities and likelihoods become smaller as one moves into the tail of a unimodal distribution. For the normal distribution, for example, the correlation between the log-transformed likelihoods and log-transformed probabilities is nearly perfect (Krueger and Heck Citation2017).

Now consider the relationship between the p-value (or p(D|H)) and the LR. Suppose the researcher has specified a discrete alternative to the null hypothesis such that H ≥ δ = 0 (i.e., there is no effect between control and treatment on some outcome measure) and ∼H ≥ δ = 0.5 (there is a standardized effect of 0.5 between control and treatment). When more than one experiment has been performed and some variation in the empirical effect size, d, has been observed, the hydraulic relationship between p(D|H) and p(D|∼H) is evident. The p-value now perfectly predicts the LR. If, however, as in our example from experimental psychology, the researcher cannot assign a specific effect size to the alternative hypothesis, and if the alternative (i.e., δ≠ 0) is variable, then p(D|∼H) varies, while the observed data and p(D|H) remain constant. In this scenario, the correlation between p and the LR is not defined. Finally, if there are several experiments, perhaps gleaned from meta-analysis or replication attempts, both p(D|H) and p(D|∼H) are variable, and so is the LR. Here, a positive association between p and LR remains (Krueger and Heck Citation2018). In other words, the p-value continues to play a useful heuristic role in predicting the LR and p(H|D).

Our analysis suggests the following conclusion: researchers may wish to supplement p-values with LRs if they can present a rationale for particular alternative hypotheses. In such cases, which remain rare in experimental psychology, they can estimate the posterior probabilities of both hypotheses with greater precision. Either way, it is unwise to discard the p-value entirely as it is generally unwise to ignore the building blocks of composite scores such as ratios or discrepancies (Krueger, Heck, and Asendorpf Citation2017).

1.3. Errare Humanum Est, Sed Quem Errorem?

One version of the concern that p-values seduce the unwary to infer the posterior probability of the tested hypothesis manifests in warnings about inflated Type I errors (Colquhoun Citation2015; Ioannides Citation2005; but see Stroebe Citation2016). Type I errors are false positives, FP, in decision-theoretic terms (Swets, Dawes, and Monahan Citation2000). In the context of NHST, FPs refer to mistaken beliefs in something that does not exist. FP aversion directly follows from the worry about the reverse inference fallacy discussed in the first section. If researchers rush to the judgment that p(H|D) < 0.5 if p(D|H) < 0.05, they may log many results as discoveries even though p(H|D) remains greater than 0.5. This is most likely to happen when the work is risky, that is, when p(H) is very high. FP aversion has spawned various recommendations for error control, most of which amount to a call for larger samples. Yet, the primary result of increased sample size is a reduction of Type II errors (i.e., failures to reject false null hypotheses). Although larger samples can raise FP rates, they tend to lower FP ratios (when Hits [rejections of false null hypotheses] rise faster than FPs; Krueger and Heck Citation2017).

An alternative recommendation is to lower the significance threshold. Lowering significance thresholds to p < 0.005, for example (Benjamin et al. Citation2017; or p < 0.001, Johnson et al. Citation2017) will reduce the number of FP errors, but at the cost of increasing the number of Type II errors to an unknown degree (Fiedler, Kutzner, and Krueger Citation2012). As significance testing becomes more conservative, FP ratios (the proportion of significant findings leading to false inferences) may even increase (Krueger and Heck Citation2017). We recommend, in the spirit of Neyman and Pearson (Citation1933a, Citationb) that researchers carefully think about the utility they assign to the two types of error before ritualistically endorsing a conventional scheme such as δ = 0.05 and δ = 0.2 (Cohen Citation1977; Citation1992; Erdfelder, Faul, and Buchner Citation1996).

To put these concerns in perspective, consider Meehl's (Citation1978) “strong use” of significance testing as a thought experiment (see also Antonakis Citation2017). Meehl, being a Popperian falsificationist at the time, asked us to imagine that a specific nonnull hypothesis (which we have referred to as ∼H) is subjected to the ordeal of the significance test. The meanings of the two errors are now reversed. A Type I error (FP) has become the rejection of a true substantive hypothesis, inviting a belief in a false null. In contrast, a Type II error (Miss) is now the failure to reject a false substantive hypothesis, inviting a false belief that the null is false. Consider the implications of these two suggested tactics of error control, increasing power and reducing significance thresholds.

The main consequence of increasing sample size—and thereby power—is the reduction of Type II errors. In the Meehl-Popper strong-inference scenario, a true null hypothesis is more likely to be detected. This prospect might subtly motivate the researcher to limit data collection. From a Popperian perspective, it makes good sense to demand high power. Alternatively, a lowering of the significance threshold to, say, 0.005, will make it harder for researchers to demonstrate invariances (i.e., the absence of an effect). When theory predicts a substantive effect, the conservative researcher may wish to relax the criterion of significance, whereas the liberal researcher would tighten it. Again, our recommendation is for researchers to consult relevant theory before deciding whether they can (or should) put a nonnull hypothesis to the test and how strict this test should be (see also Lakens et al. Citation2018). If theory or compelling convention does not require unbalanced a priori error rates, we think it prudent to use the same probabilities for both types of error (e.g., α = β = 0.1; see also Bredenkamp Citation1980; as well as Neyman and Pearson themselves Citation1933a, Citationb).

The discourse of error presupposes decisions regarding significance and decisions need criteria. The conventional criterion for declaring significance is p < 0.05, and much has been made of the presumed rigidity of its application (Greenland Citation2017; Wasserstein and Lazar Citation2016). A radical response is to surrender all pretension of decision-making and to limit inference to the task of judgment by, for example, estimating a probable range of values for p(H|D) given the obtained p-value and auxiliary assumptions (Krueger and Heck Citation2017).

As experimentalists, we are reluctant to relinquish dichotomous decision-making entirely. The first reason for this traditionalism is that for many questions humans ask of nature, the null (or any particular tested hypothesis) may in fact be true.Footnote1 Nature presents some true-false dichotomies, and it would be unwise to use a decision-making scheme that ignores this. A profusion of pseudo-, para-, and alternative sciences thrive on false claims. They assert the presence of not-nothing when there is nothing (Krueger, Vogrincic-Haselbacher, and Evans, Citationin preparation). The second epigraph of this article may serve as an illustration.

It is easy to see the truth of this claim (this claim itself has a good probability of being true) when considering categorical data. We rightly treat categorical questions to true-or-false responses. In contrast, much of the statistical debate (including this short article) plays out in the world of continuous variables. Here, the claim that the tested hypothesis cannot be literally true is an artifact of allowing each specific point on the scale to be no more than one among an infinity of points. Such a point has no probability, only a likelihood—and as a number, this likelihood is meaningful only when compared with (e.g., divided by) another number.

Returning to our example of the experiment on creativity, the researcher is tasked with assigning respondents randomly to conditions. Comparing the experimental with the control condition, we may wonder if any difference, however small, might eventually yield significance without being an FP. If, however, we compare two control conditions, both created at random and each without a treatment, we would know that p(H) = 1 and that any p < 0.05 could only be an FP. We now infer that if the null hypothesis can be true if we ensure that it is, it can also be true when we did not.

The second reason for allowing a judicious use of dichotomous inference is psychological. Categorization is built into perception (Bruner, Goodnow, and Austin Citation1956). To perceive is to categorize. We cannot see Argos without also seeing a dog, and we cannot see Eumaeos without seeing a man, a Greek, and a swineherd. Categorization affords induction. Recognized as a dog, Argos tells us much about his conspecifics (e.g., love of his master); if we ignored the inductive power of categorical perception, we would be condemned to encounter each dog as a novel creature. Ditto for Eumaeos. Following Bruner and other pioneers, Tajfel (Citation1959; Citation1969) proposed accentuation theory to formalize the interplay of categorization and perception of stimuli falling on a graded scale. Accentuation theory predicts that if there is a categorical line drawn somewhere on the continuum, stimuli falling to the left (lower) of this line will be seen as having been shifted to the left, while stimuli falling to the right (higher) will be seen as having been shifted to the right. The theory also predicts that this effect is strongest near the line itself, which results in a perceptual narrowing of each category (Krueger and Clement Citation1994).

Consider the implications of accentuation theory for the perception of p-values. Values < 0.05 will be regarded as small, while values > 0.05 will be regarded as large, with little discrimination among values falling on the same side of the divide. Some may take this perceptual sharpening as evidence of the dangers of dichotomous statistical inference (Gelman and Stern Citation2006); others will view it as an unavoidable byproduct of categorization. We side with the latter group, thinking that binary statistical inference ultimately stands in the service of action. Neyman and Pearson (Citation1933a, Citationb) explicitly subordinated statistical decision-making to action, and Fisher (Citation1956) did so implicitly—at best. When a fertilizer has been tested, the agronome must decide whether to use it. We ask researchers to be mindful of accentuation theory and to consider if and how their work is connected to decisions about action. In this exposition of our findings, we can only mention, but not explore, the differences between inference and decision, and the complexities arising along the road from statistical inference to scientific inference to practical inference. Suffice it to note our agreement with Fisher (Citation1959, p. 100) who wrote that “an important difference is that decisions are final, while the state of opinion derived from a test of significance is provisional, and capable, not only of confirmation, but of revision.”

1.4. Statistics as a Social Process

The future of statistical practice will not be decided by logical proof or empirical test. Try to imagine what a critical experiment might look like! Would the data be analyzed with significance tests, model-fitting techniques, or subjective Bayesian interpretation? How would you go about managing the two types of error in this experiment? This is a difficult question: if our methods are designed to test the predictions derived from our theories, how might the methods themselves and their parent theories be tested? We submit that the acceptance of a set of methods is largely a matter of social process, rhetoric, and Zeitgeist. There surely are innovations, reforms, and refinements of methods that can be strictly justified on logico-mathematical grounds, but such advances are usually made within the context of a paradigm or school of thought. A recent example is the introduction of the p-rep index, which was meant to replace the conventional p-value (Killeen Citation2005). Only p-rep, but not p, was to reveal the probability of successful replication. It soon turned out that p-rep was a mere log-linear transformation of p, and so the index went as fast as it had come (Wagenmakers Citation2007).Footnote2 Other innovations have found traction, however, such as the now widespread use of multi-level regression models (Austin Citation2017).

When schools of thought whose most central thoughts are incommensurable (Feyerabend Citation1976; Fleck Citation1935; Kuhn Citation1962) vie for the affections of bench scientists, rhetoric and social process become relevant. The result is a cross-purpose debate. In the current climate, Bayesians emphasize the coherence of their methods compared with the ease with which significance tests produce incoherent patterns (e.g., inferences violating the axiom of transitivity). Significance testers, in contrast, emphasize the vulnerability of Bayesian methods to belief bias (Revlin et al. Citation1980). False priors make bad estimates, unless the data are really big—in which case neither Bayesian nor frequentist methods are needed. Frequentists may turn some of the criticisms leveled at significance testing back on the Bayesians. The dangers of HARKing (Hypothesizing After the Results are Know; Kerr Citation1998), for example, apply to both schools, as do the dangers of limiting sample size (Krueger and Heck Citation2017). Other attempts to reform research practice do not affect the analytical methods themselves. Preregistration regimens are intended to protect researchers from one another and themselves (e.g., their implicit biases and p-hacking self-deceptions), but are mute on the logic of inductive inference.

When there are no incontestable criteria to settle methodological debates, social processes unfold that might lead to the eventual dominance of one school of thought. These processes are ‘soft’ in that they cannot leverage irrefutable proof or unquestionable authority. The limits of authority may be found at the level of the journal editor who by way of Diktat declares certain methods out of bounds. Reporting of p-rep was declared obligatory by the journal Psychological Science; its demise was less formal. The editors of Basic and Applied Psychology banned p-values from their pages (Trafimow and Marks Citation2015). Professional societies convene task forces to work out recommendations for statistical reporting, but, as part of the social process, the staffing of such task forces must not be partisan (Wasserstein and Lazar Citation2016; Wilkinson and APA Task Force Citation1999). As a result, the recommendations tend to be encouragements rather than pre- or proscriptions.

More forceful demands surface when groups of statisticians and researchers band together to issue calls for changes in practice. Recently, Benjamin et al. (Citation2017) proposed a lowering of the conventional significance threshold to 0.005, conceding that this was a patch, and that elegant solutions were being prepared, but that there was no consensus yet. The ‘‘et al.’’ of this group were 71 individuals, many of whom are quite distinguished. This is a large number. It dramatizes the claimed consensus that something needs to be done. The social psychology of this tactic is to disarm resistance by leveraging the heuristic of ‘‘social proof’’ (Cialdini Citation1984). If all these statisticians agree—the bench scientist is nudged to infer—then I should shrink my alpha. In a réplique, Lakens recruited 87 co-authors to argue against any privileged alpha level. Reasonable as their arguments are, the use of social proof wears thin when it begins to smack of mimicry and when the ratio of the number of written words to the number of authors becomes distressingly low (here: 22). There ought to be a minimum criterion value for that!

We conclude that the social forces that shape research practice are part and parcel of any evolving science. Instead of wishing them away, we'd do well to understand them, and hope they will move the practice of science forward.

1.5. Conclusion

In 2001, one of us (Krueger Citation2001) predicted that NHST would outlast many of its challenges. This prediction was partly based on the method's intrinsic value, partly on the beside-the-pointness of some of the critiques, and partly on the fact that by that time, NHST had already shown itself to be resilient. These three reasons are not independent of one another, and there is no reason for complacency. Like others before us (Abelson Citation1995; Wasserstein and Lazar Citation2016; Wilkinson and the APA Task Force on Statistical Inference Citation1999), we wish to impress on research scientists that the p-value is a mere heuristic. It has predictive value, but it guarantees that some false inferences will be made. The p-value cannot do all of the inductive work; no single method can. We join those who recommend researchers use a toolbox of statistical techniques, employ good judgment, and keep an eye on developments in statistical and data science.

Notes

1 The argument that any point hypothesis is false (e.g., Cohen Citation1990; Gelman and Carlin Citation2017) is a mathematical artifact of the assumption that there is an infinite number of hypotheses (Krueger Citation2001). It would be more accurate to say that the probability of such a hypothesis to be true is indeterminate; that is, such an hypothesis would be neither true nor false irrespective of the evidence.

2 New indices with little or no new informational input occasionally appear in the literature. Bayarri, Benjamin, Berger, and Sellke (Citation2016), for example, suggested that p-values be replaced by upper-bound Bayes Factors as 1-ep log p. Here, all the variation is in the p-value.

3 A trolleyologist is someone doing empirical research on philosophical thought experiments such as Philippa Foot's trolley-footbridge dilemma Citation(Foot 1978/1967).

References

  • Abelson, R. P. (1995), Statistics as Principled Argument, Hillsdale, NJ: Erlbaum.
  • Antonakis, J. (2017), “On Doing Better Science: From Thrill of Discovery to Policy Implications,” The Leadership Quarterly, 28, 5–21.
  • Austin, P. C. (2017), “A Tutorial on Multi-Level Survival Analysis: Methods, Models and Applications,” International Statistical Review, 85, 185–203.
  • Bayarri, M. J., Benjamin, J., Berger, J. O., and Sellke, T. M. (2016), “Rejection Odds and Rejection Ratios: A Proposal for Statistical Practice in Testing Hypotheses,” Journal of Mathematical Psychology, 72, 90–103.
  • Benjamin, D. J., Berger, J., Johannesson, M., Nosek, B. A., Wagenmakers, E.-J., Berk, R., … Johnson, V. (2017), “Redefine Statistical Significance”, available at osf.io/preprints/psyarxiv/mky9j
  • Bredenkamp, J. (1980), Theone und Planung psychologischer Experimente [ Theory and Design of Psychological Experiments], Darmstadt, Germany: Stemkopff.
  • Bruner, J. S., Goodnow, J. J., and Austin, G. A. (1956), A Study of Thinking, New York: Wiley.
  • Cialdini, R. (1984), Influence: The Psychology of Persuasion, Boston, MA: Allyn & Bacon.
  • Cohen, J. (1977), Statistical Power Analysis for the Behavioral Sciences, New York: Academic Press.
  • ——— (1990), “Things I Have Learned (So Far),” American Psychologist, 45, 1304–1312.
  • ——— (1992), “A Power Primer,” Psychological Bulletin, 112, 155–159.
  • ——— (1994), “The Earth is Round (p < .05),” American Psychologist, 49, 997–1003.
  • Colquhoun, D. (2015), “An Investigation of the False Discovery Rate and the Misinterpretation of p-Values,” Royal Society Open Science, 1, 140216.
  • Dawes, R. M. (1988), Rational Choice in an Uncertain World, San Diego, CA: Harcourt, Brace, Jovanovich.
  • Dawes, R. M., and Smith, T. L. (1985), “Attitude and Opinion Measurement,” in Handbook of Social Psychology (3rd ed.), eds. G. Lindzey & E. Aronson, New York: Random House, pp. 509–566.
  • Erdfelder, E., Faul, F., and Buchner, A. (1996), “GPOWER: A General Power Analysis Program,” Behavior Researrch Methods, Instruments & Computers, 28, 1–11.
  • Feyerabend, P. K. (1976), Against Method, New York: Humanities Press.
  • Fisher, R. A. (1956), Statistical Methods and Scientific Inference, Oxford, UK: Hafner Publishing Co.
  • Fiedler, K., Kutzner, F., and Krueger, J. I. (2012), “The Long Way from Error Control to Validity Proper: Problems With a Short-Sighted False-Positive Debate,” Perspectives on Psychological Science, 7, 661–669.
  • Fleck, L. (1935), Entstehung Und Entwicklung Einer Wissenschaftlichen Tatsache. Einführung In Die Lehre Vom Denkstil Und Denkkollektiv, eds. L. Schäfer and T. Schnelle, Frankfurt, Germany: Benno Schwabe & Co.
  • Foot, P. (1978/1967), Virtues and Vices, Oxford, UK: Blackwell. Originally published in the Oxford Review, 5, 1967.
  • Gelman, A., and Carlin, J. (2017), “Some Natural Solutions to the p-Value Communication Problem—and Why They Won't Work,” Journal of the American Statistical Association, 112, 899–901.
  • Gelman, A., and Stern, H. (2006), “The Difference Between ‘Significant’ and ‘Not Significant’ is Not Itself Statistically Significant,” The American Statistician, 60, 328–331.
  • Gigerenzer, G. (2004), “Mindless Statistics,” The Journal of Socio-Economics, 33, 587–606.
  • Gigerenzer, G., and Marewski, J. (2015), “Surrogate Science: The Idol of A Universal Method for Scientific Inference,” Journal of Management, 41, 421–440.
  • Gilhooly, K. J., Georgiou, G., and Devery, U. (2012), “Incubation and Creativity: Do Something Different,” Thinking & Reasoning, 19, 137–149.
  • Goodman, S. (2008), “A Dirty Dozen: Twelve p-Value Misconceptions,” Seminars in Hematology, 45, 135–140.
  • Greenland, S. (2017), “The Need for Cognitive Science in Methodology,” American Journal of Epidemiology, 186, 639–645.
  • Greenland, S., Senn, S. J., Rothman, K. J., Carlin, J. B., Poole, C., Goodman, S. N., and Altman, D. G. (2016), “Statistical Tests, p-values, Confidence Intervals, and Power: A Guide to Misinterpretations,” European Journal of Epidemiology, 31, 337–350.
  • Ioannidis, J. P. A. (2005), “Why Most Published Research Findings are False,” PLoS Medicine, 2, e124.
  • Johnson, V. E., Payne, R. D., Wang, T., Asher, A., and Mandal, S. (2017), “On the Reproducibility of Psychological Science,” Journal of the American Statistical Association, 112, 1–10.
  • Kerr, N. L. (1998), “HARKing: Hypothesizing After the Results are Known,” Personality and Social Psychology Review, 2, 196–217.
  • Killeen, P. R. (2005), “An Alternative to Null-Hypothesis Significance Tests,” Psychological Science, 16, 345–353.
  • Krueger, J. I. (2001), “Null Hypothesis Significance Testing: On the Survival of a Flawed Method,” American Psychologist, 56, 16–26.
  • ——— (2017), “Reverse Inference,” in Psychological Science Under Scrutiny: Recent Challenges And Proposed Solutions, eds. S. O. Lilienfeld & I. D. Waldman, New York: Wiley, pp. 110–124.
  • Krueger, J., and Clement, R. W. (1994), “Memory-Based Judgments About Multiple Categories: A Revision and Extension of Tajfel's Accentuation Theory,” Journal of Personality and Social Psychology, 67, 35–47.
  • Krueger, J. I., and Heck, P. R. (2017), “The Heuristic Value of p in Inductive Statistical Inference,” Frontiers in Psychology, available at https://www.frontiersin.org/articles/10.3389/fpsyg.2017.00908/full.
  • ——— (2018), “Testing Significance Testing,”
  • Krueger, J. I., Heck, P. R., and Asendorpf, J. B. (2017), “Self-enhancement: Conceptualization and Assessment,” Collabra: Psychology, 3, 28.
  • Krueger, J. I., Vogrincic-Haselbacher, C., and Evans, A. M. (in preparation), “We Need a Credible Theory of Gullibility. To Appear in Forgas, J. P. and Baumeister, R. F. (2019),” in Homo Credulus: The Social Psychology of Gullibility [The 20th Sydney Symposium on Social Psychology], New York: Taylor & Francis.
  • Kuhn, T. S. (1962), The Structure of Scientific Revolutions, Chicago, IL: University of Chicago Press.
  • Lakens, D., Adolfi, F. G., Albers, C. J., Anvari, F., Apps, M. A. J., Argamon, S. E., … Zwaan, R. A. (2018), “Justify Your Alpha,” Nature Human Behavior, 2, 168–171.
  • Meehl, P. E. (1978), “Theoretical Risks and Tabular Asterisks: Sir Karl, Sir Ronald, And The Slow Progress of Soft Psychology,” Journal of Consulting and Clinical Psychology, 46, 806–834.
  • Mischel, W. (1968), Personality and Assessment, Mahwah, NJ: Erlbaum.
  • Neyman, J., and Pearon, E. S. (1933a), “On the Problem of the Most Efficient Tests of Statistical Hypotheses,” Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 231, 694–706.
  • ——— (1933b), “The Testing of Statistical Hypotheses in Relation to Probabilities a Priori,” Mathematical Proceedings of the Cambridge Philosophical Society, 29, 492–510.
  • Nickerson, R. S. (2000), “Null hypothesis Significance Testing: A Review of an Old and Continuing Controversy,” Psychological Methods, 5, 241–301.
  • Ortega, A., and Navarrete, G. (2017), “Bayesian Hypothesis Testing: An Alternative to Null Hypothesis Signficance Testing (NHST) in Psychology,” in Bayesian Inference, ed. J. P. Tejedor, Open access, doi:10.5772/intechopen.70230.
  • Ostrowski, B. (2017), “Vujà Dé: The Effects of Incubation On Creativity,” unpublished honors thesis in psychology, Providence, RI: Brown University.
  • Revlin, R., Leirer, V., Yopp, H., and Yopp, R. (1980), “The Belief-Bias Effect in Formal Reasoning: The Influence of Knowledge On Logic,” Memory & Cognition, 8, 584–592.
  • Stigler, S. M. (1999), Statistics on the Table: The History Of Statistical Concepts and Methods, Cambridge, MA: Harvard University Press.
  • Stroebe, W. (2016), “Are most Published Social Psychological Findings False?” Journal of Experimental Social Psychology, 66, 134–144.
  • Swets, J. A., Dawes, R. M., and Monahan, J. (2000), “Psychological Science can Improve Diagnostic Decisions,” Psychological Science in the Public Interest, 1, 1–26.
  • Tajfel, H. (1959), “Quantitative Judgement in Social Perception,” British Journal of Psychology, 50, 16–29.
  • ——— (1969), “Cognitive Aspects of Prejudice,” Journal of Social Issues, 25, 79–97.
  • Trafimow, D. (2003), “Hypothesis Testing and Theory Evaluation at the Boundaries: Surprising Insights from Bayes's Theorem,” Psychological Review, 110, 526–535.
  • Trafimow, D., and Marks, M. (2015), “Editorial,” Basic and Applied Social Psychology, 37, 1–2.
  • Wagenmakers, E.-J. (2007), “A Practical Solution to the Pervasive Problems of p-values,” Psychonomic Bulletin & Review, 14, 779–804.
  • Wasserstein, R. L., and Lazar, N. A. (2016) “The ASA's Statement on p-Values: Context, Process, and Purpose,” The American Statistician, 70, 129–133; 21 online expert commentaries available at http://amstat.tandfonline.com/doi/suppl/10.1080/00031305.2016.1154108?scroll=top.
  • Wilkinson, L., and the Task Force on Statistical Inference (1999), “Statistical Methods in Psychology Journals: Guidelines and Explanations,” American Psychologist, 54, 594–604.

Appendix

To guard against the impression that our appeal to good judgment is mere handwaving and buck-passing, we offer two examples from recent experience and vivid memory. The first example features a trolleyologistFootnote3 explaining the decision to test the significance of 44 successes under the null hypothesis of 25. “I would have done a significance test if I had 50 out of 50 successes,” the researcher continued, “because I know the editor would have demanded it.” But why perform a significance test to reject the chance hypothesis when the results are so clear that a naked-eye test reveals the effect. To say that a test would have been performed if the result had been 50 out of 50 is to endorse and perpetuate an empty ritual. We advise the use of good judgment to oppose such rituals, and look forward to seeing more articles in which striking results are reported using only descriptive statistics.

The second example involves a correlation between a trait measure of moral orientation and a specific moralistic action. This correlation turned out to be +0.08. With over 900 degrees of freedom, this correlation was “highly significant.” When asked whether a correlation of +0.02, if significant in a much larger sample, would be deemed satisfactory as well, the researcher said “yes.” What might be done to de-absurdify this approach to data analysis? For this case, we recommend that a plausible substantive hypothesis be selected for strong significance testing in Meehl's sense or for the estimation of a likelihood ratio. To wit, a correlation of +0.3 may be installed to represent ∼H, that is, the baseline expectation of how well general traits predict specific behaviors (Dawes and Smith Citation1985; Mischel Citation1968).