5,212
Views
18
CrossRef citations to date
0
Altmetric
General

Reconnecting p-Value and Posterior Probability Under One- and Two-Sided Tests

&
Pages 265-275 | Received 24 Feb 2019, Accepted 11 Jan 2020, Published online: 25 Feb 2020

Abstract

As a convention, p-value is often computed in frequentist hypothesis testing and compared with the nominal significance level of 0.05 to determine whether or not to reject the null hypothesis. The smaller the p-value, the more significant the statistical test. Under noninformative prior distributions, we establish the equivalence relationship between the p-value and Bayesian posterior probability of the null hypothesis for one-sided tests and, more importantly, the equivalence between the p-value and a transformation of posterior probabilities of the hypotheses for two-sided tests. For two-sided hypothesis tests with a point null, we recast the problem as a combination of two one-sided hypotheses along the opposite directions and establish the notion of a “two-sided posterior probability,” which reconnects with the (two-sided) p-value. In contrast to the common belief, such an equivalence relationship renders p-value an explicit interpretation of how strong the data support the null. Extensive simulation studies are conducted to demonstrate the equivalence relationship between the p-value and Bayesian posterior probability. Contrary to broad criticisms on the use of p-value in evidence-based studies, we justify its utility and reclaim its importance from the Bayesian perspective.

1 Introduction

Hypothesis testing is ubiquitous in modern statistical applications, which permeates many different fields such as biology, medicine, psychology, economics, engineering, etc. As a critical component of the hypothesis testing procedure (Lehmann and Romano Citation2005), p-value is defined as the probability of observing the random data as or more extreme than the observed given the null hypothesis being true. In general, the statistical significance level or the Type I error rate is set at 5%, so that a p-value below 5% is considered statistically significant leading to rejection of the null hypothesis.

Although p-value is the most commonly used summary measure for evidence or strength in the data regarding the null hypothesis, it has been the center of controversies and debates for decades. To clarify ambiguities surrounding p-value, the American Statistical Association (Wasserstein and Lazar Citation2016) issued statements on p-value and, in particular, the second point states that “P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.” It is often argued that p-value only provides information on how incompatible the data are with respect to the null hypothesis, but it does not give any information on how likely the data would occur under the alternative hypothesis.

Extensive investigations have been conducted on the properties of the p-value and its inadequacy as a summary statistic. Rosenthal and Rubin (Citation1983) studied how p-value can be adjusted to allow for greater power when an order of importance exists in the hypothesis tests. Royall (Citation1986) investigated the effect of sample size on p-value. Schervish (Citation1996) described computation of the p-value for one-sided point null hypotheses, and also discussed the intermediate interval hypothesis. Hung et al. (Citation1997) studied the behavior of p-value under the alternative hypothesis, which depends on both the true value of the tested parameter and sample size. Rubin (Citation1998) proposed an alternative randomization-based p-value for double-blind clinical trials with noncompliance. Sackrowitz and Samuel-Cahn (Citation1999) promoted more widespread use of the expected p-value in practice. Donahue (Citation1999) suggested that the distribution of the p-value under the alternative hypothesis would provide more information for rejection of implausible alternative hypotheses. As there is a widespread notion that medical research is interpreted mainly based on p-value, Ioannidis (Citation2005) claimed that most of the published findings in medicine are false. Hubbard and Lindsay (Citation2008) showed that p-value tends to exaggerate the evidence against the null hypothesis. Simmons, Nelson, and Simonsohn (Citation2011) demonstrated that p-value is susceptible to manipulation to reach the significance level of 0.05 and cautioned against its use. Nuzzo (Citation2014) gave an editorial on why p-value alone cannot serve as adequate statistical evidence for inference.

Criticisms and debates on p-value and null hypothesis significance testing have become more contentious in recent years. Focusing on discussions surrounding p-values, a special issue of The American Statistician (2019) contains many proposals to adjust, abandon or provide alternatives to p-value (e.g., Benjamin and Berger Citation2019; Betensky Citation2019; Billheimer Citation2019; Manski Citation2019; Matthews Citation2019, among others). Several academic journals, for example, Basic and Applied Social Psychology and Political Analysis, have made formal claims to avoid the use of p-value in their publications (Trafimow and Marks Citation2015; Gill Citation2018). Fidler et al. (Citation2004) and Ranstam (Citation2012) recommended use of the confidence interval as an alternative to p-value, and Cumming (Citation2014) called for abandoning p-value in favor of reporting the confidence interval. Colquhoun (Citation2014) investigated the issue of misinterpretation of p-value as a culprit for the high false discovery rate. Concato and Hartigan (Citation2016) suggested that p-value should not be the primary focus of statistical evidence or the sole basis for evaluation of scientific results. McShane et al. (Citation2019) recommended that the role of p-value as a threshold for screening scientific findings should be demoted, and that p-value should not take priority over other statistical measures. In the aspect of reproducibility concerns of scientific research, Johnson (Citation2013) traced one major cause of nonreproducibility as the routine use of the null hypothesis testing procedure. Leek et al. (Citation2017) proposed abandonment of p-value thresholding and transparent reporting of false positive risk as remedies to the replicability issue in science. Benjamin et al. (2018) recommended shifting the significance threshold from 0.05 to 0.005, while Trafimow et al. (Citation2018) argued that such a shift is futile and unacceptable.

Bayesian approaches are often advocated as an alternative solution to the various aforementioned issues related to the p-value. Goodman (Citation1999) strongly supported use of the Bayes factor in contrast to p-value as a measure of evidence for medical evidence-based research. Rubin (Citation1984) introduced the predictive p-value as the tail-area probability of the posterior predictive distribution. In the applications to psychology, Wagenmakers (Citation2007) revealed the issues associated with p-value and recommended use of the Bayesian information criterion instead. Briggs (Citation2017) proposed that p-value should be proscribed and be substituted with the Bayesian posterior probability, while Savalei and Dunn (Citation2015) expressed skepticism on the utility of abandoning p-value and resorting to alternative hypothesis testing paradigms, such as Bayesian approaches, in solving the reproducibility issue.

On the other hand, extensive research has been conducted in an attempt to reconcile or account for the differences between frequentist and Bayesian hypothesis testing procedures (Pratt Citation1965; Berger Citation2003; Bayarri and Berger Citation2004). For hypothesis testing, Berger and Sellke (Citation1987), Berger and Delampady (Citation1987), and Casella and Berger (Citation1987) investigated the relationships between p-value and the Bayesian measure of evidence against the null hypothesis. In particular, they provided an in-depth study of one-sided hypothesis testing and point null cases, and also discussed the posterior probability of the null hypothesis with respect to various prior distributions including the mixture prior distribution with a point mass at the null and the other more broad distribution over the alternative (Lindley Citation1957). Sellke, Bayarri, and Berger (Citation2001) proposed to calibrate p-value for testing precise null hypotheses.

Although p-value is often regarded as an inadequate representation of statistical evidence, it has not stalled the scientific advancement in the past years. Jager and Leek (Citation2014) surveyed publications in high-profile medical journals and estimated the rate of false discoveries in the medical literature using reported p-values as the data, which led to a conclusion that the medical literature remains a reliable record of scientific progress. Murtaugh (Citation2014) defended the use of p-value on the basis that it is closely related to the confidence interval and Akaike’s information criterion.

By definition, p-value is not the probability that the null hypothesis is true. However, contrary to the conventional notion, p-value does have a simple and clear Bayesian interpretation in many common cases. Under noninformative priors, p-value is asymptotically equivalent to the Bayesian posterior probability of the null hypothesis for one-sided tests, and is equivalent to a transformation of the posterior probabilities of the hypotheses for two-sided tests. For hypothesis tests with binary outcomes, we can derive the asymptotical equivalence based on the theoretical results in Dudley and Haughton (Citation2002), and conduct simulation studies to corroborate the connection. For normal outcomes with known variance, we can derive the analytical equivalence between the posterior probability and p-value; for cases where the variance is unknown, we rely on simulations to show the empirical equivalence when the prior distribution is noninformative. Furthermore, we extend such equivalence results to two-sided hypothesis testing problems, where most of the controversies and discrepancies exist. In particular, we formulate a two-sided test as a combination of two one-sided tests along the opposite directions, and introduce the notion of “two-sided posterior probability” which matches the p-value from a two-sided hypothesis test. It is worth emphasizing that our approach for two-sided hypothesis tests is novel and distinct from the existing approaches where a probability point mass is typically placed on the null hypothesis. We assume a continuous prior distribution and establish an equivalence relationship between the p-value and a transformation of the posterior probabilities of the two opposite alternative hypotheses.

The rest of the article is organized as follows. In Section 2, we present a motivating example that shows the similarity in operating characteristics of a frequentist hypothesis test and its Bayesian counterpart using the posterior probability. Section 3 shows that p-value and the posterior probability have an equivalence relationship for binary data, and Section 4 draws a similar conclusion for normal data. Finally, Section 5 concludes with some remarks.

2 Motivating Example

Consider a two-arm clinical trial comparing the response rates of an experimental treatment and the standard of care, denoted as pE and pS respectively. In a one-sided hypothesis test, we formulate (1) H0:pEpSversusH1:pE>pS.(1)

Under the frequentist approach, we construct a Z-test statistic, (2) Z=p̂Ep̂S[{p̂E(1p̂E)+p̂S(1p̂S)}/n]1/2,(2) where n is the sample size per arm, p̂E=yE/n and p̂S=yS/n are the sample proportions, yE and yS are the numbers of responders in the respective arms. We reject the null hypothesis if Zzα, where zα is the 100(1α)th percentile of the standard normal distribution.

Under the Bayesian framework, we assume beta prior distributions for pE and pS, that is, pEBeta(aE,bE) and pSBeta(aS,bS). The binomial likelihood function for group g can be formulated as P(yg|pg)=(nyg)pgyg(1pg)nyg,g=E,S.

The posterior distribution of pg is given by (3) pg|ygBeta(ag+yg,bg+nyg),(3) for which the density function is denoted by f(pg|yg). Let η be a prespecified probability cutoff. We declare treatment superiority if the posterior probability of pE greater than pS exceeds threshold η; that is, Pr(H1|yE,yS)=Pr(pE>pS|yE,yS)η=1α, where Pr(pE>pS|yE,yS)=01pS1f(pE|yE)f(pS|yS)dpEdpS.

For one-sided tests with binary data, the asymptotical equivalence between the posterior probability of the null (PoP1) and p-value can be derived from the theoretical results in Dudley and Haughton (Citation2002). Controlling the posterior probability Pr(H0|yE,yS)α would lead to the frequentist Type I error rate below α. Thus, we can set η=1α to maintain the frequentist Type I error rate at α.

The Type I and Type II error rates under the frequentist design are respectively given by Pr(RejectH0|H0)=yE=0nyS=0nP(yE|pE=pS)P(yS|pS)I(Zzα),Pr(AcceptH0|H1)=yE=0nyS=0nP(yE|pE=pS+δ)P(yS|pS)I(Z<zα), where δ is the desired treatment difference and I(·) is the indicator function. The corresponding error rates under the Bayesian design can be derived as above by replacing Zzα with Pr(pE>pS|yE,yS)1α and Z<zα with Pr(pE>pS|yE,yS)<1α inside the indicator functions.

As a numerical illustration, we set Type I error rates at 10% and 5% and target power at 80% and 90% when (pS,pE)=(0.2,0.3) and (pS,pE)=(0.2,0.35), respectively. To achieve the desired power, the required sample size per arm is n=(zα+zβ)2δ2{pE(1pE)+pS(1pS)}, where δ=pEpS. Under the Bayesian design, we assume noninformative prior distributions (e.g., Jeffreys’ prior), pSBeta(0.5,0.5) and pEBeta(0.5,0.5). For comparison, we compute the Type I error rate and power for both the Bayesian test with η=1α and the frequentist Z-test with a critical value zα. As shown in , both designs can maintain the Type I error rates at the nominal level and the power at the target level. It is worth noting that because the endpoints are binary and the trial outcomes are discrete, exact calibration of the empirical Type I error rate to the nominal level is not possible, particularly when the sample size is small. When we adopt a larger sample size by setting the Type I error rate to be 5% and the target power to be 90%, the empirical Type I error rate is closer to the nominal level. Due to the discreteness of the Type I error rate formulation, near the boundary points, for example, where pE = pS is close to 0 or 1, the Type I error rate might be subject to inflation.

Fig. 1 Comparison of the Type I error rate and power under the frequentist Z-test and Bayesian test based on the posterior probability for detecting treatment difference δ=0.1 (left) and δ=0.15 (right).

Fig. 1 Comparison of the Type I error rate and power under the frequentist Z-test and Bayesian test based on the posterior probability for detecting treatment difference δ=0.1 (left) and δ=0.15 (right).

3 Hypothesis Test for Binary Data

Following the motivating example in the previous section, the frequentist Z-test for two proportions in (2) leads to the (one-sided test) p-value, pvalue1=1Φ(Z), where Φ(·) denotes the cumulative distribution function (CDF) of the standard normal distribution. At the significance level of α, we reject the null hypothesis if p-value is smaller than or equal to α. In the Bayesian paradigm, as given in (3), we reject the null hypothesis if the posterior probability of H0:pEpS is smaller than or equal to α; that is, PoP1=Pr(pEpS|yE,yS)α.

In a numerical study, we set n = 20, 50, 100, and 500, and enumerate all possible integers between 2 and n – 2 to be the values for yE and yS (the extreme cases with 0, 1, n – 1, and n are excluded as the p-values cannot be estimated well using normal approximation). We take Jeffreys’ prior for pE and pS, that is, pE,pSBeta(0.5,0.5), which is a well-known noninformative prior distribution. For each configuration, we compute the posterior probability of the null hypothesis Pr(H0|yE,yS) and the p-value. As shown in , all the paired values lie very close to the straight line of y = x, indicating the equivalence between the p-value and posterior probability of the null. shows the histograms of differences between p-values and posterior probabilities Pr(H0|yE,yS) under sample sizes of 20, 50, 100, and 500, respectively. As sample size increases, the distribution of the differences becomes more centered toward 0, further corroborating the asymptotic equivalence relationship.

Fig. 2 The relationship between p-value and the posterior probability of the null in one-sided hypothesis tests with binary outcomes under sample sizes of 20, 50, 100, and 500 per arm, respectively.

Fig. 2 The relationship between p-value and the posterior probability of the null in one-sided hypothesis tests with binary outcomes under sample sizes of 20, 50, 100, and 500 per arm, respectively.

Fig. 3 Histograms of the differences between p-values and posterior probabilities of the null over 1000 replications in one-sided hypothesis tests with binary outcomes under sample sizes of 20, 50, 100, and 500, respectively.

Fig. 3 Histograms of the differences between p-values and posterior probabilities of the null over 1000 replications in one-sided hypothesis tests with binary outcomes under sample sizes of 20, 50, 100, and 500, respectively.

For two-sided hypothesis tests, we are interested in examining whether there is any difference in the treatment effect between the experimental drug and the standard drug, H0:pE=pSversusH1:pEpS, for which the (two-sided test) p-value is given by pvalue2=2{1Φ(|Z|)}=2[1max{Φ(Z),Φ(Z)}].

It is worth emphasizing that under the frequentist paradigm, the two-sided test can be viewed as a combination of two one-sided tests along the opposite directions. Therefore, to construct an equivalent counterpart under the Bayesian paradigm, we may regard the problem as two opposite one-sided Bayesian tests and compute the posterior probabilities of the two opposite hypotheses. This approach to Bayesian hypothesis testing is different from those commonly adopted in the literature, where a prior probability mass is imposed on the point null (see, e.g., Berger and Delampady Citation1987; Berger and Sellke Citation1987; Berger Citation2003). If we define the two-sided posterior probability (PoP2) as (4) PoP2=2[1max{Pr(pE>pS|yE,yS),Pr(pE<pS|yE,yS)}],(4) then its equivalence relationship with p-value is similar to that under one-sided hypothesis testing as shown in .

Fig. 4 The relationship between p-value and the transformation of the posterior probabilities of the hypotheses in two-sided tests with binary outcomes under sample sizes of 20, 50, 100, and 500, respectively.

Fig. 4 The relationship between p-value and the transformation of the posterior probabilities of the hypotheses in two-sided tests with binary outcomes under sample sizes of 20, 50, 100, and 500, respectively.

The PoP2 in (4) is a transformation of two posterior probabilities of opposite events, and it does not correspond to a single event of interest. However, there exists a sensible interpretation of PoP2 under the Bayesian framework by a simple twist of the definition. Consider a Bayesian test with a point null hypothesis, H0:pE=pS versus H1:pE=pS. If the null H0:pE=pS is not true, then either pE > pS or pE < pS is true. In the Bayesian framework, it is natural to compute Pr(pE>pS|yE,yS) and Pr(pE<pS|yE,yS), and define PoP2*=max{Pr(pE>pS|yE,yS),Pr(pE<pS|yE,yS)}, which would be compared to a certain threshold cT for decision making. If PoP2* is large enough, then we reject the null hypothesis. Compared with (4), it is easy to show that the decision rule of PoP2*cT=1α/2, is equivalent to PoP2α, which would lead to rejection of the null hypothesis. In this case, PoP2 behaves exactly like p-value in hypothesis testing. That is, by taking cT=1α/2 and comparing the larger value between Pr(pE>pS|yE,yS) and Pr(pE<pS|yE,yS) with cT, we are able to construct a Bayesian test that has an equivalence connection to the frequentist test.

More importantly, the definition of PoP2 is uniquely distinct from the traditional Bayesian hypothesis testing of the two-sided test where H0 is a point null. Under the traditional Bayesian method, a point mass is typically assigned on the prior distribution of H0, while various approaches to defining the prior density under H1 have been discussed (Casella and Berger Citation1987; Berger and Delampady Citation1987; Berger and Sellke Citation1987). Using such an approach, it is difficult to reconcile the classic p-value and the posterior probability. In contrast, instead of assigning a prior probability mass on the point null H0, our approach to a two-sided hypothesis test takes the maximum of posterior probabilities of two opposite events, under a continuous prior distribution with no probability mass assigned on the point null.

The equivalence of the p-value and the posterior probability in the case of binary outcomes can be established by applying the Bayesian central limit theorem. Under large sample size, the posterior distributions of pE and pS can be approximated as pg|ygN(p̂g,p̂g(1p̂g)/n),g=E,S.

As yE and yS are independent, the posterior distribution of pEpS can be derived as pEpS|yE,ySN(p̂Ep̂S,{p̂E(1p̂E)+p̂S(1p̂S)}/n).

Therefore, the posterior probability of H0:pEpS is PoP1=Pr(pEpS|yE,yS)Φ(p̂Ep̂S[{p̂E(1p̂E)+p̂S(1p̂S)}/n]1/2)=Φ(Z), which is equivalent to pvalue1=1Φ(Z)=Φ(Z). The equivalence relationship for a two-sided test can be derived along similar lines.

More generally, the equivalence relationship between the posterior probability and pvalue can be derived from the theoretical results in Dudley and Haughton (Citation2002), where the asymptotic normality of the posterior probability of half-spaces is studied. More specifically, a half-space H is a set satisfying a linear inequality, H={θ:aTθb}, where θRd is a vector of interest, aRd, and b is a scalar. Let Δ denote the log likelihood ratio statistic between the unrestricted maximum likelihood estimator (MLE) in the entire support of the parameter and the MLE restricted on the boundary hyperplane of H, H={θ:aTθ=b}. Dudley and Haughton (Citation2002) proved that under certain regularity conditions, the posterior probability of a half space converges to the standard normal CDF transformation of the likelihood ratio test statistic, Φ(2Δ). In our case, the half-space under the context of hypothesis testing is {(pE,pS):pSpE0} for a two-arm trial with binary endpoints. Based on the arguments in Dudley and Haughton (Citation2002), it can be easily shown that the posterior probability of the null is asymptotically equivalent to the pvalue from the likelihood ratio test.

4 Hypothesis Test for Normal Data

4.1 Hypothesis Test With Known Variance

In a two-arm randomized clinical trial with normal endpoints, we are interested in comparing the means of the outcomes between the experimental and standard arms. Let n denote the sample size for each arm, and let yEi and ySi, i=1,,n, denote the observations in the experimental and standard arms, respectively. We assume that yEi and ySi are independent, and yEiN(μE,σ2), ySiN(μS,σ2), with unknown means μE and μS but a known variance σ2=1 for simplicity. Let y¯E=i=1nyEi/n and y¯S=i=1nySi/n denote the sample means, and let θ=μEμS and θ̂=y¯Ey¯S denote the true and the observed treatment difference, respectively.

4.1.1 Exact Equivalence

Considering a one-sided hypothesis test, (5) H0:θ0versusH1:θ>0,(5) the frequentist test statistic is formulated as θ̂n/2, which follows the standard normal distribution under the null hypothesis. Therefore, the p-value under the one-sided hypothesis test is given by (6) pvalue1=Pr(Uθ̂n/2|H0)=1Φ(θ̂n/2),(6) where U denotes the standard normal random variable.

Let D denote the observed values of yEi and ySi, i=1,,n. In the Bayesian paradigm, if we assume an improper flat prior distribution, p(θ)1, then the posterior distribution of θ is θ|DN(θ̂,2/n).

Therefore, the posterior probability of the null is PoP1=Pr(θ0|D)=1Φ(θ̂n/2), which is exactly the same as (6). Under such an improper prior distribution of θ, we can establish an exact equivalence relationship between p-value and Pr(H0|D).

Under a two-sided hypothesis test, H0:θ=0versusH1:θ0, the p-value is given by (7) pvalue2=2[1max{Pr(U>θ̂n/2|H0),Pr(U<θ̂n/2|H0)}]=22max{Φ(θ̂n/2),Φ(θ̂n/2)}.(7)

Correspondingly, the two-sided posterior probability is defined as PoP2=2[1max{Pr(θ<0|D),Pr(θ>0|D)}]=22max{Φ(θ̂n/2),Φ(θ̂n/2)}, which is exactly the same as the p-value2 in (7).

4.1.2 Asymptotic Equivalence

If we assume a normal prior distribution, θN(0,σ02), then the posterior distribution of θ is θ|DN(μ˜,σ˜2), where μ˜=nθ̂σ02nσ02+2,σ˜2=2σ02nσ02+2.

The posterior probability of the null under the one-sided hypothesis test in (5) is PoP1=Pr(θ0|D)=1Φ(μ˜/σ˜)=1Φ(θ̂(n/2)1n/2+1/σ02).

Therefore, it is evident that as σ0 (i.e., under noninformative priors), the posterior probability of the null converges to the p-value, that is, pvalue1=limσ02Pr(θ0|D). For two-sided hypothesis tests, asymptotic equivalence can be derived along similar lines. Moreover, it is worth noting that the same asymptotical equivalence holds when the sample size n goes to infinity, in which case the prior has a negligible effect on the posterior and both the p-value1 and PoP1 would converge to 0. For one-sided hypothesis testing problems, Casella and Berger (Citation1987) provided theoretical results reconciling the p-value and Bayesian posterior probability for symmetric distributions that enjoy the properties of a monotone likelihood ratio. The results under normal endpoints can be regarded as corroboration of the theoretical findings in Casella and Berger (Citation1987).

4.2 Hypothesis Test With Unknown Variance

In a more general setting, we consider the case where μE, μS, and σ are all unknown parameters. For simplicity, we define xi=yEiySi, which follows the normal distribution N(θ,2σ2) under the independence assumption of yEi and ySi. Similar to a matched-pair study, the problem is reduced to a one-sample test for ease of exposition. In the frequentist paradigm, Student’s t-test statistic is T=θ̂i=1n(xiθ̂)2/{(n1)n}, where θ̂=x¯=y¯Ey¯S. The p-value under the one-sided hypothesis test (5) is pvalue1=1Ftn1(T), where Ftn1(·) denotes the CDF of Student’s t distribution with n – 1 degrees of freedom.

In the Bayesian paradigm, for notational simplicity, we let ν=2σ2 and model the joint posterior distribution of θ and ν. Under Jeffreys’ prior for θ and ν, f(θ,ν)ν3/2, the corresponding posterior distribution is f(θ,ν|D)ν(n+3)/2exp{i=1n(xiθ̂)2+n(θ̂θ)22ν}, which matches the normal-inverse-chi-square distribution, (θ,ν)|DNInvχ2(θ̂,n,n,i=1n(xiθ̂)2/n).

As a result, the one-sided posterior probability of the null hypothesis is PoP1=Pr(θ0|D).

To study the influence of prior distributions, we also consider a normal-inverse-gamma prior distribution for θ and ν, (θ,ν)NIG(θ0,ν0,α,β).

The corresponding probability density function (PDF) can be written as the product of a normal density function and an inverse gamma density function, f(θ,ν)=fN(θ|θ0,ν/ν0)fIG(ν|α,β)=ν02πνβαΓ(α)(1ν)α+1exp(2β+ν0(θθ0)22ν), where fN(·|θ0,ν/ν0) denotes the PDF of a normal distribution with mean θ0 and variance ν/ν0, and fIG(·|α,β) denotes that of an inverse gamma distribution with parameters α and β. Due to the conjugate prior property, the corresponding posterior distribution is also a normal-inverse-gamma distribution; that is, (θ,ν)|DNIG(θ0ν0+nθ̂ν0+n,ν0+n,α+n2,β+12i=1n(xiθ̂)2+nν0ν0+n(θ̂θ0)22).

For a two-sided hypothesis test, the p-value is pvalue2=22Ftn1(|T|)=2[1max{Ftn1(T),Ftn1(T)}].

Similarly, we can define the two-sided posterior probability as PoP2=2[1max{Pr(θ>0|D),Pr(θ<0|D)}].

4.3 Numerical Studies

As a numerical illustration, we simulate 1000 data replications, and for each replication we compute the p-value and the posterior probability of the null. We consider both Jeffreys’ prior and the normal-inverse-gamma prior. The data are generated from normal distributions, that is, xiN(θ,ν). To ensure the p-values from simulations to cover the entire range of (0, 1), we generate values of θ from N(0,0.05) and ν from truncated N(1,0.05) above zero. To construct a noninformative normal-inverse-gamma prior distribution, we take θ0=0,ν0=100, and α=β=0.01. Under Jeffreys’ prior and the noninformative normal-inverse-gamma prior distributions, shows the equivalence relationship between p-values and the posterior probabilities of the null under both one- and two-sided tests with sample sizes of 20, 50, and 100, respectively.

Fig. 5 The relationship between p-value and the posterior probability over 1000 replications under one-sided and two-sided hypothesis tests with normal outcomes assuming Jeffreys’ prior and the non-informative normal-inverse-gamma prior under sample sizes of 20, 50, and 100, respectively.

Fig. 5 The relationship between p-value and the posterior probability over 1000 replications under one-sided and two-sided hypothesis tests with normal outcomes assuming Jeffreys’ prior and the non-informative normal-inverse-gamma prior under sample sizes of 20, 50, and 100, respectively.

In addition, we conduct sensitivity analysis to explore different data generation distributions and informative priors. In particular, we generate xi from Gamma(2,0.5), Beta(0.5,0.5), as well as a mixture of normal distributions of N(1,1) and N(1, 1) with equal weights. To allow the p-values to cover the entire range of (0, 1), the simulated values of xi are further deducted by the mean of the corresponding distribution. Under Jeffreys’ prior, again exhibits the equivalence relationship between p-values and the posterior probabilities of the null under one-sided tests with sample sizes of 20 and 50, respectively.

Fig. 6 The relationship between p-value and the posterior probability of the null over 1000 replications under one-sided hypothesis tests with outcomes generated from Gamma, Beta, and mixture normal distributions, assuming Jeffreys’ prior for the mean and variance parameters of normal distributions under sample sizes of 20 and 50, respectively.

Fig. 6 The relationship between p-value and the posterior probability of the null over 1000 replications under one-sided hypothesis tests with outcomes generated from Gamma, Beta, and mixture normal distributions, assuming Jeffreys’ prior for the mean and variance parameters of normal distributions under sample sizes of 20 and 50, respectively.

To study the effect of informative prior and sample size on the relationship between p-value and the posterior probability, we assign an informative prior distribution on θ by setting θ0=θ+0.01,ν0=0.01 (a small prior variance), and α=β=0.01. The left panel of shows that under such an informative prior distribution the equivalence relationship between p-values and the posterior probabilities of the null is lost, while it can be gradually gained back with increasing sample sizes. Moreover, we consider the case where the sample size is fixed at 1000 but the prior variance is increased by setting ν0 from 0.001 to 1, and we still keep θ0=θ+0.01. The right panel of exhibits that as the prior distribution becomes less informative, the equivalence relationship turns out to be more evident. This is as expected, because the p-value is obtained using the observed data alone without borrowing any prior information, and thus noninformative priors should be used to compute the posterior probability for fair comparisons.

Fig. 7 The relationship between p-value and the posterior probability of the null over 1000 replications under one-sided hypothesis tests with normal outcomes; left panel: assuming a fixed informative normal-inverse-gamma prior under increasing sample sizes of 1000, 10,000, and 100,000 (from top to bottom), right panel: assuming a fixed sample size of 1000 with an increasing prior variance of 0.001, 0.01, and 1 (from top to bottom).

Fig. 7 The relationship between p-value and the posterior probability of the null over 1000 replications under one-sided hypothesis tests with normal outcomes; left panel: assuming a fixed informative normal-inverse-gamma prior under increasing sample sizes of 1000, 10,000, and 100,000 (from top to bottom), right panel: assuming a fixed sample size of 1000 with an increasing prior variance of 0.001, 0.01, and 1 (from top to bottom).

5 Discussion

Berger and Sellke (Citation1987) studied the point null for two-sided hypothesis tests, and noted discrepancies between the frequentist test and the Bayesian test based on the posterior probability. The major difference between their work and ours lies in the specification of the prior distribution. Berger and Sellke (Citation1987) assumed a point mass prior distribution at the point null hypothesis, which violates the regularity condition of continuity in Dudley and Haughton (Citation2002), and thus leads to the discrepancy between the posterior probability and p-value. An underlying condition for our established equivalence is that the union of the support of the parameter under the null and the alternative is the natural whole space of the parameter support, for example, the natural whole space for a normal mean parameter is the real line, that for a probability parameter is (0, 1), and that for a variance parameter is the positive-half real line (0,).

Casella and Berger (Citation1987) provided theoretical results attempting to reconcile the p-value and Bayesian posterior probability in one-sided hypothesis testing problems. Especially, they showed that for certain distributional families the infimum of the Bayesian posterior probability can be reconciled with p-value. Our established equivalence between the p-value and Bayesian posterior probability for normal endpoints can be regarded as more in-depth corroboration of their theoretical results. Furthermore, we demonstrate a similar equivalence relationship for binary endpoints, which was not discussed in Casella and Berger (Citation1987). More importantly, for two-sided hypothesis tests, we establish the notion of the “two-sided posterior probability” by recasting the problem as a combination of two one-sided hypotheses along the opposite directions, which reconnects with the two-sided p-value.

Acknowledgments

We thank the editor Professor Daniel R. Jeske, associate editor, and three referees, for their many constructive and insightful comments that have led to significant improvements in the article. We also thank Chenyang Zhang and Jiaqi Gu for helpful discussions.

Additional information

Funding

The research was supported by a grant (grant number 17307318) for Yin from the Research Grants Council of Hong Kong.

References