1,067
Views
7
CrossRef citations to date
0
Altmetric
Discussion Paper and Discussions

Prior-based Bayesian information criterion

, , , , &
Pages 2-13 | Received 24 Jun 2017, Accepted 10 Feb 2019, Published online: 14 Mar 2019

ABSTRACT

We present a new approach to model selection and Bayes factor determination, based on Laplace expansions (as in BIC), which we call Prior-based Bayes Information Criterion (PBIC). In this approach, the Laplace expansion is only done with the likelihood function, and then a suitable prior distribution is chosen to allow exact computation of the (approximate) marginal likelihood arising from the Laplace approximation and the prior. The result is a closed-form expression similar to BIC, but now involves a term arising from the prior distribution (which BIC ignores) and also incorporates the idea that different parameters can have different effective sample sizes (whereas BIC only allows one overall sample size n). We also consider a modification of PBIC which is more favourable to complex models.

1. Background

1.1. The original BIC (Schwarz, Citation1978)

Suppose that we observe Xi=(Xi1,,Xip)g(xiθ) for i=1,,n. Here θ=(θ1,,θp) is a unknown vector and, in Schwartz's derivation of BIC, g(xθ) is an exponential family. Then the log-likelihood function is l(θ)=logf(xθ)=logi=1ng(xiθ), where x=(x1,,xn). The goal of Schwarz (Citation1978) is to find a simple approximation to the marginal density m(x)=f(xθ)π(θ)dθ, where π(θ) is a prior density for the unknown θ, and use the approximation for model comparison.

Result 1.1

Stone, Citation1979

Let θˆ be the MLE of θ. Then, under reasonable conditions and as n, BIC2l(θˆ)+plogn=2logm(x)+c+o(1), where c is a constant.

Schwartz then suggested comparing two models M1 and M2, using ΔBIC=BIC2BIC1,

preferring M2 (M1) as this is negative (positive). Clearly this is equivalent to basing the model comparison on the Bayes factor (odds) of M2 to M1, with the approximation (1) B21m2(x)m1(x)=exp12BIC2exp12BIC1exp12(c2c1)×(1+o(1))exp12BIC2exp12BIC1.(1)

1.2. Problems with general use of BIC

BIC is an excellent tool for the class of problems for which it was developed. Unfortunately, it is today used ubiquitously, for completely different classes of problems. We here outline some of the issues with using BIC inappropriately.

Problem 1. The term exp(12(c2c1)) in (Equation1) is ignored by BIC.

This could have been a serious problem even with proper use of BIC, except that there happens to be pseudo-prior distributions that yield BIC itself (Raftery, Citation1999), i.e. for which the term exp(12(c2c1))=1. These pseudo-priors are not real priors, in that they are centred at the mle's of each model, which is a problematical double use of the data. Nevertheless it is comforting that there is at least some type of prior distribution that yields BIC exactly.

Problem 2. What is n?

  1. A common mistake in specifying n: Note that, in Schwartz's setup, there are n vector observations of dimension p, so that there are a total of np real observations. It is common to mistakenly use n=np as the sample size in BIC, rather than the correct n.

  2. Different parameters can have different n.

Example 1.2

Group means

For i=1,,p and l=1,,r, suppose we observe Xil=μi+εil, where εilN(0,σ2). If σ2 were known, this would be exactly the setup of Schwartz, and the sample size for μ=(μ1,,μp) would be r. In effect, each μi has a sample size of r associated with it. But, if σ2 is unknown, the parameter is θ=(μ1,,μp,σ2) and it is not reasonable to also associate the sample size of r to σ2, in that we know there are p(r1) degrees of freedom associated with the mle of σ2.

An alternative argument is to note that the observed information matrix Iˆ=(Iˆjk), with (j,k) entry Iˆjk=2θjθklogf(xθ)|θ=θˆ is given by Iˆ=rσˆ2Ip×p00pr2σˆ4, where σˆ2=(1/pr)i=1pl=1r(XilX¯i)2. The information matrix suggests that the effective sample size for each μi is r, while the effective sample size for σ2 is pr. Whether we use p(r1) or pr for the sample size associated with σ2 will not typically make much difference, whereas the difference with using r, instead, will be quite large.

  1. Different observations can have different observed information content.

Example 1.3

Suppose each independent observation, Xi,i=1,,n, has probability 1/2 of arising from the N(θ,1) distribution and probability 1/2 of arising from the N(θ,1000) distribution. Clearly half the observations are essentially worthless, and the ‘effective sample size’ is n/2.

Example 1.4

Findley's BIC counterexample

One of the famous counter examples against inappropriate use of BIC is in Findley (Citation1991). Suppose the observations are (2) Xi=1iθ+εi,where εiN(0,1),i=1,,n,(2) and we are comparing the models H0:θ=0 and H1:θ0. It turns out that the mle for θ is consistent under H1 (a necessary condition to apply BIC), but that BIC is inconsistent if 0<|θ|<1, in that BIC will then declare H0 to be the true model as n. The problem here is that, even though the information about θ goes to ∞ as n grows, it grows much more slowly than n (actually, the information grows at roughly logn rate), and BIC erroneously assigns the rate to be n.

Problem 3. What is p?

Just as n is often not clearly defined for use in BIC, the parameter dimension p is often not clearly defined (see also Pauler, Citation1998).

Example 1.5

Random effect group means

Consider hierarchical or random effect versions of the group means problem, where it is assumed that μiN(ξ,τ2), with ξ and τ2 being unknown. The number of parameters might appear to be p+3 (the means, along with σ2, ξ and τ2), but one could, alternatively, integrate out μ=(μ1,,μp) (since it has a known distribution) obtaining f(xσ2,ξ,τ2)=f(xμ,ξ,σ2)π(μξ,τ2)dμ1σp(r1)expσˆ22σ2i=1p×exp(x¯iξ)22(σ2r+τ2). The marginal likelihood will be the integral of this, with respect to a prior π(σ2,ξ,τ2), so that, if one is really viewing BIC as an approximation to the marginal likelihood, it would be correct to set p=3.

Problem 4. What if p grows with n?

BIC is based on an asymptotic argument with p fixed and n growing, but often p is growing with n; BIC then does not apply. If one were to erroneously apply BIC in such a situation, one could end up with inconsistency, as shone by Stone (Citation1979) for the group means example, with known variance σ2=1 for simplicity. Indeed, in comparing models H0:μ=0 and H1:μ0 for the group means problem with r=2, ΔBIC=BIC1BIC0=2i=1px¯i2+plog2, which, under H0, behaves like p(log21) as p grows, thus incorrectly selecting model H1.

1.3. Variants of BIC

Noting the limitations of BIC, researchers have proposed a host of generalisations, many of which have performed better than BIC under specific scenarios. Many of these methods arise from the variations in retaining the number of terms in the Laplace approximation of the Bayes Factor (Kass & Raftery, Citation1995). One variant – called the HBIC – (Haughton, Citation1988) retains the third term in the Laplace approximation of the Bayes Factor. A simulation study by Haughton, Oud, and Jansen (Citation1997) shows that HBIC performs better in model selection for structural equation models than does the usual BIC. Following HBIC, Bollen, Ray, Zavisca, and Harden (Citation2012) developed a similar criterion, called the Information matrix-based Bayesian Information Criterion (IBIC), which retains more terms in the Bayes Factor approximation and outperforms BIC and HBIC in many scenarios. Bollen et al. (Citation2012) also proposed another criterion, named the scaled unit information prior (SPBIC), which generalises the interpretation of the unit information prior in the context of BIC. For approximation of Bayes factors as the model dimension grows, Berger, Ghosh, and Mukhopadhyay (Citation2003) proposed another approximation, named GBIC. Following Berger et al. (Citation2003), a generalisation of BIC for the general exponential family was proposed by Chakrabarti and Ghosh (Citation2006), and a new BIC for change point analysis was proposed by Shen and Ghosh (Citation2011). Some other extensions of BIC include techniques for comparing graphical models (Foygel & Drton, Citation2010), singular models (Drton & Plummer, Citation2017), and sparse models (Zak-Szatkowska & Bogdan, Citation2011).

1.4. Overview of the paper

Section 2 presents a proposal to generalise BIC, in order to overcome the problems mentioned above. It is based on use of a specific (robust) prior distribution in the computation of the approximate marginal likelihood of a model. Section 3 discusses a critical aspect of the definition of PBIC, namely the need to determine the ‘effective sample size’ corresponding to each parameter in a model. Section 4 presents an alternative called PBIC*. It employs an empirical Bayes prior in computation of the marginal likelihood approximation, resulting in answers more favourable to complex models. Section 5 illustrates the use of PBIC and PBIC* in the normal linear model; it is of interest that PBIC and PBIC* correspond to exact marginal likelihoods here. Illustrations in the section are simple linear regression, testing the equality of normal means with known unequal variances, Findley's counterexample, and the group means problem, where consistency results for PBIC and PBIC* are established as p.

2. The PBIC solution

We propose a solution to these problems that depends only on software that can compute mle's and observed information matrices. The basis of the solution is a modified Laplace approximation to m(x) for reasonable default priors.

2.1. Two important preliminaries

One should analytically integrate out any parameter that has a distribution given other parameters, if it is possible to do so. For example, in the hierarchical group means example, base the analysis on the marginal likelihood f(xσ2,ξ,τ2), rather than the full likelihood.

We will be utilising the Laplace approximation, which is most accurate (Kass & Vaidyanathan, Citation1992; Tierney, Kass, & Kadane, Citation1989) if the parameter space is transformed to be all of p. Transformation to p will also be necessary for the subsequent step of the analysis. As an illustration, in the (non-multilevel) group means example, transform to ν=logσ2. Then θ=(μ1,,μp,ν)(p+1). Note that one then works with the transformed mle logσˆ2 and the transformed observed information matrix Iˆ(μ,ν)=rσˆ2Ip×p00pr2. In the multilevel group means example, both σ2 and τ2 would need to be transformed in this fashion.

2.2. PBIC and PBIC* definitions

Suppose θ=(θ(1),θ(2)), where θ(2) denotes the parameters that are common to all models under consideration (e.g. an intercept in linear regression). Changing notation, let p denote the dimension of θ(1) and q denote the dimension of θ(2); note that p will typically vary from model to model, while q is fixed. Partition the observed information matrix for a model accordingly, as (3) Iˆ=Iˆ11Iˆ12Iˆ21Iˆ22,and define Σ1=Iˆ11Iˆ12Iˆ221Iˆ12t.(3) (If there are no common parameters to all models, then Σ=Iˆ1.) Change variables to ξ=Oθ(1), where O is an orthogonal matrix such that Σ=OtDO, with D=diag{di} for i=1,,p, and define ξˆ=Oθˆ(1) (the transformed mle). The choice of O does not affect the definition below. For each transformed parameter ξj, let nje be the effective sample size corresponding to that parameter. This is the most difficult aspect of the construction, but equals the intuitive choices of parameter sample size discussed in the earlier examples; formal definitions will be presented in Section 3. Then PBIC is defined as (4) PBIC2l(θˆ)+log|Iˆ22|+i=1plog(1+nie)2i=1plog1evi2vi,(4) where vi=ξˆi2/[di(1+nie)]. For a certain natural prior distribution, PBIC will be shown to be accurate, as an approximation to 2logm(x), up to a o(1) term as n (for fixed dimension p). Note that, if there are no common parameters to all models, then (5) PBIC=2l(θˆ)+i=1plog(1+nie)2i=1plog1evi2vi.(5) In the classic case considered by Schwartz, all nie would equal a common n, and the first two terms in this expression are then BIC (up to a o(1) term); the ‘constant’ ignored by BIC is the final term in (Equation5).

To summarise results in one place, here is an alternative version of the approximation, one which is more favourable to complex models; its development is given in Section 4: (6) PBIC2l(θˆ)+log|Iˆ22|+i=1plog(1+nie)2i=1plog1emin{vi,1.3}2vi min{vi,1.3}.(6) Note that, if dealing with only normal mean parameters, PBIC and PBIC* are exact as an approximation to 2logm(x), as discussed below. This would mean, for instance, that when dealing with p, there would be no need to worry about the accuracy of the approximations.

Here are the steps in the derivation of PBIC.

2.2.1. Laplace approximation

By a Taylor's series expansion of l(θ) about the mle θˆ, with ∇ denoting the gradient and Iˆ being the observed information matrix as defined earlier, (7) m(x)=f(xθ)π(θ)dθ=el(θ)π(θ)dθ=expl(θˆ)+(θθˆ)tl(θˆ)1212(θθˆ)tIˆ(θθˆ)π(θ)dθ(1+o(1))=el(θˆ)e(1/2)(θθˆ)tIˆ(θθ)ˆπ(θ)dθ(1+o(1)),(7) where o(1) denotes a term that goes to zero as the sample size n grows. Technical conditions for the validity of this Laplace approximation can be found in, e.g. Tierney et al. (Citation1989), Kass and Vaidyanathan (Citation1992); the key assumption needed is that θˆ occurs on the interior of the parameter space, so that l(θˆ)=0. (If not true, the analysis must proceed as in Dudley & Haughton, Citation1997; Haughton, Citation1991Citation1993). Also, the presence of o(1) assumes that p is fixed as n grows. We will nevertheless use this approximation, even as p grows with n, relying on the considerable evidence that the Laplace approximation is quite generally accurate.

Note that we do not use the more common version of the Laplace expansion which involves π(θ) in the Taylor's expansion because we will be choosing π(θ) so that the integral in this expression can be evaluated in closed form. In particular, this means that, if we are dealing with the situation where θ is the mean parameter of a normal model, then the computations herein will be entirely closed form, with no approximation being involved (and no need to then worry about p growing with n).

2.2.2. Choosing a good prior π(θ)

Assume that the transformations in Section 2.1 have been made.

Step 1. Recall that θ=(θ(1),θ(2)), where θ(2) denotes the common parameters to all models. We will utilise a prior distribution π(θ)=(2π)qπ(θ(1)), where π(θ(1)) is defined later. The key point is that, since θ(2) is common to all models, it can be assigned a constant prior density (see, e.g. Bayarri, Berger, Forte, & García-Donato, Citation2012; Berger, Pericchi, & Varshavsky, Citation1998); choosing the constant to be (2π)q is to simplify the resulting expression. With the definitions given in (Equation3), integrating out θ(2) results in the expression m(x)=el(θˆ)exp12(θθˆ)tIˆ(θθˆ)×(2π)qdθ(2)π(θ(1))dθ(1)(1+o(1))=el(θˆ)|Iˆ|1/2×exp12(θ(1)θˆ(1))tΣ1(θ(1)θˆ(1))|Σ|1/2×π(θ(1))dθ(1)(1+o(1)).

Step 2. Change variables to ξ=Oθ(1), where O is an orthogonal matrix such that Σ=OtDO, with D=diag(di) for i=1,,p. (The choice of O does not matter in the following.) Note that ξˆ=Oθˆ(1).

For this model, we will utilise a prior distribution that is independent in the ξi, i.e. π(ξ)=i=1pπi(ξi). Then we can write (8) m(x)=el(θˆ)|Iˆ|1/2×i=1p1die((ξiξˆi)2/2di)πi(ξi)dξi×(1+o(1)).(8) For πi(ξi), in a similar situation, Jeffreys (Citation1961) recommended the Cauchy(0,bi) density (1/πbi)(1/(1+ξi2/bi)), where bi is chosen to represent unit information for ξi (see Kass & Wasserman, Citation1995; also to be discussed later). A prior that yields almost the same results is (9) πiR(ξi)=01Nξi|0,12λi(di+bi)di12λidλi,(9) which is well-defined if bidi. Interestingly, this prior is very similar to the Cauchy prior no matter what di happens to be (as shown in the Appendix), so we will interpret this prior (and bi) exactly as we would with the Cauchy prior. The attraction of πR is that the ensuing computations can be done in closed form. That one can have all the advantages that Jeffreys pointed out are possessed by the Cauchy prior for model selection, while maintaining closed form expressions, is a significant advantage when dealing with large model spaces. This prior was extensively discussed in Berger (Citation1985), as a robust prior (hence the R label) for estimation problems, but its even greater value for model selection was not recognised. (This type of prior was first utilised in Strawderman (Citation1971) in shrinkage estimation.) See also Bayarri et al. (Citation2012), where a multivariate version of this prior is utilised for model selection in normal linear models.

With the prior in (Equation9), the integral in (Equation8) is straightforward to evaluate in closed form (first integrate over ξi, then over λi) yielding (10) m(x)=el(θˆ)|Iˆ|1/2×i=1p1(di+bi)1eξˆi2/(di+bi)2ξˆi2/(di+bi)×(1+o(1)).(10)

Step 3. Define the unit information, bi, by (11) bi=niedi,wherenie=effective sample size for ξi and recall vi=ξˆi2di(1+nie).(11) Definitions of the effective sample size will be given in Section 3. It will be the case that nie1 so that bidi (the condition mentioned earlier for πR to be well defined). Then (Equation10) becomes m(x)=el(θˆ)|Iˆ|1/2|D|1/2i=1p×11+nie1evi2vi(1+o(1)). Since |Iˆ|=|Σ1||Iˆ22|=|D1||Iˆ22|, we thus have that 2logm(x)=PBIC+o(1), with PBIC defined in (Equation4).

3. Defining ‘effective sample size’ nj, for parameter ξj

The most difficult aspect of dealing with PBIC turns out to be defining the effective sample size corresponding to a parameter. We first present a solution for linear models, and then suggest a possible solution for the general case.

3.1. Effective sample sizes in linear models

Suppose that all models under consideration are linear models of the form (12) Y=Xα+Xβ+ε,where εN(0,Γ),  Γ known,(12) with dimensions Y[n×1], X[n×q], α[q×1], X[n×p],β[p×1],ε[n×1] and Γ[n×n]. Here Xα is a common term present in all models (e.g. an intercept in linear regression), but Xβ will differ from model to model. This fits into the framework for PBIC by defining θ(1)=β and θ(2)=α.

Since α will be integrated out in PBIC, only the effective sample size for linear functions of β will be needed. The first step of the process is to orthogonalise the parameters by transforming α to α=α+(XtΓ1X)1XtΓ1Xβ and defining (13) X~=(IX(XtΓ1X)1XtΓ1)X.(13) Since Xα+Xβ=Xα+X~β, the linear part of the model has not changed in this reparameterisation, but now X~tΓ1X=0, so that α and β are orthogonal. There are two important aspects of this. First, since X has not been altered, the new α can still be considered common parameters in each model, and will be integrated out in PBIC, so that their changed definition is irrelevant. Second, β has not been transformed, crucial because we wish effective sample sizes for linear functions of β

Write Γ=σRσ, with σ=diag{σ1,,σp}, where R is the correlation matrix, and define C[p×p] to be the diagonal matrix with entries cii=maxj{|X~ji|/σj}. Berger, Bayarri, and Pericchi (Citation2014) gave, as the general definition of the effective sample size (called TESS), for any scalar linear transformation ξ=vβ (v is [1×p]) of β, (14) ne= |v|2vC(X~tΓ1X~)1Cvt.(14)

Example 3.1

Group means example

Assume Yij=μi+εij for i=1,,p groups, and j=1,,ri replicates in the ith group, and that the εij are i.i.d. N(0,σ2). Computation yields that TESS for μi is nie=ri, as is to be expected. Note that ri could be 1, which can be seen to be the lower bound on TESS for linear models when Γ=σ2I.

Example 3.2

Orthogonal and related designs

Assume that X has orthogonal columns with entries ±ai0, and that Γ=σ2I. Simple computation here shows that nie=n for each βi.

Note that the effective sample size here is n, in contrast to the group means problem where the effective sample size can be as low as ri=1. Indeed, it can be shown that, when Γ=σ2I, TESS will always be between 1 and n, with both limits attainable.

Example 3.3

Heteroscedastic independent observations

Assume Yi=μ+εi, εi independent, εiN(0,σi2), i=1,,n. Here the effective sample size is ne=i=1n1/σi2maxi{1/σi2}. Consider the particular case where, for i=1,,n1, we have YiN(μ,σ12), whereas for the remaining n2=nn1 observations, YiN(μ,σ22), where σ22 is much larger than σ12; thus, intuitively, only the first n1 observations count. Then, unless n2 is large, ne=n1/σ12+n2/σ221/σ12=n1+n2σ12σ22 n1.

3.2. A general definition of effective sample size

Suppose one has independent observations (x1,,xn). A possible general definition for the ‘effective sample size’ follows from considering the information associated with observation xi arising from the single-observation expected information matrix Ii=O(Ii,jk)O, where Ii,jk=E2θjθklogfi(xiθ)|θ=θˆ. Since Ijj=i=1nIi,jj is the expected information about ξj, a reasonable way to define the effective sample size, nje, is

  • define information weights wij=Ii,jj/k=1nIk,jj;

  • define the effective sample size for ξj as nje=Ijji=1nwijIi,jj=Ijj2i=1nIi,jj2.

Intuitively, wijIi,jj is a weighted measure of the information ‘per observation’, and dividing the total information about ξj by this information per case seems plausible as an effective sample size.

Unfortunately, this does not seem to always be an effective definition; for instance, it does not reduce to TESS for all linear models. This should thus be viewed as primarily a starting point for future investigations of effective sample size in nonlinear models.

4. PBIC*: a version more favourable to complex models

Recall, from Raftery (Citation1999), that BIC can be thought of as arising from unit information priors for each model that are centred at the model likelihood. This choice of prior seems highly favourable to more complex models, since the prior gives virtually all of its mass to a modest neighbourhood of the likelihood for each model.

In contrast, PBIC utilises unit information priors that are centred at 0 and, hence, can give little mass to the region of high model likelihood. The fat tails of the prior do result in reasonable answers (cf. Bayarri et al., Citation2012; Jeffreys, Citation1961), but it is of interest to investigate an intermediate solution.

The intermediate solution is to keep the prior centred at 0, but choose the scales of the prior, bi, so that the prior will extend out to the likelihood. In our setup, this can be implemented by choosing the bi so as to maximise m(x) in (Equation10); thus we are effectively choosing the prior in our class that is most favourable to each model. Clearly this does allow the prior to give more mass to the region of high model likelihood, but does not allow complete concentration of mass in this region.

Since this prior is maximising the marginal likelihood among the given class, it can be viewed as the empirical Bayes prior in the class. It was also a choice popularised in the ‘robust Bayes’ literature (cf. Berger & Sellke, Citation1987), and was used in Bollen et al. (Citation2012) to develop a related generalisation of BIC.

The bi that maximises (Equation10) can easily be seen to be bˆi=max{di,ξˆi2w di},with w s.t. ew1=2w, or w1.3. Unfortunately, the resulting version of BIC has serious problems; in particular it will typically not be consistent as n in that, if ξi is zero, the prior will concentrate about zero at such a fast rate that the models with and without ξi are essentially equivalent (and one will fail to select the model without ξi with probability approaching 1). This same lack of consistency afflicts the developments in Bollen et al. (Citation2012) and the robust Bayesian choices.

The obvious solution is simply to prevent bˆi from becoming too small, and the obvious constraint is to restrict it to the region [niedi,). This yields the recommended choice (15) bi=maxniedi,ξˆi21.3 di.(15) This will avoid inconsistency as n in that, as long as bic with c a non-zero constant, the resulting prior behaves asymptotically when ξi=0 as a fixed prior, and fixed priors will yield consistency as n. (Consistency when the effective sample sizes do not grow is a more delicate issue, discussed in Section 5.5.)

Replacing bi with bi, (Equation10) becomes m(x)=el(θˆ)|Iˆ|1/2i=1p1di(1+nie)max{1,vi/1.3}×1emin{vi,1.3}2min{vi,1.3}(1+o(1))=el(θˆ)|Iˆ|1/2i=1p1di(1+nie)×1emin{vi,1.3}2vi min{vi,1.3}(1+o(1)). The resulting approximation to 2logm(x) is given in (Equation6).

5. PBIC and PBIC* for the linear model

5.1. The expressions

Consider the normal linear model framework in (Equation12) and assume the orthogonalisation discussed there has been carried out. This does not change PBIC, but is more convenient because we can ignore the common orthogonal parameter α (see Bayarri et al., Citation2012; Berger et al., Citation1998; Jeffreys, Citation1961 for justification), and focus only on the other parameters β, with the associated model (16) Y=X~β+ε,where εN(0,Γ),Γ known,(16) with X~ given by (Equation13).

Following the PBIC algorithm, note that Σ1=X~Γ1X~. Change variables to ξ=Oβ, where O is an orthogonal matrix such that Σ=OtDO, with D=diag(di) for i=1,,p. Then, for each ξj=Ojβ, define nje using (Equation14) with v=Oj, and let ξˆj=Ojβˆ, where βˆ=(X~Γ1X~)1X~Γ1Y. Finally, recalling that vi=ξˆi2/[di(1+nie)], PBIC and PBIC* are given by (17) PBIC=S2+C+i=1plog(1+nie)2i=1plog1evi2vi(17) (18) PBIC*=S2+C+i=1plog(1+nie)2i=1plog1emin{vi,1.3}2vi min{vi,1.3},(18) where S2 is the usual residual sum of squares corresponding to (Equation12) and C=log(|Γ|)+nlog(2π)=log(|Γ|)+nlog(2π). Note that C is the same constant in any model under consideration, and hence it can be ignored in comparing models or determining Bayes factors.

In what follows we describe some important Linear Model examples. There are more, including correlated observations and autoregressive models, in Berger et al. (Citation2014).

5.2. Simple linear regression

Let Yi=α+Xiβ+εi, εii.i.d.N(0,σ2), so that Y=1X11Xnαβ+ε1εn,where εN(0,σ2I). Suppose we are considering two models M0:β=0 and M1:β0. Computation under M1 yields X~=(X1X¯,,XnX¯), so that Σ=σ2/sx2=σ2/i=1n(XiX¯)2. Also, from (Equation14), (19) ne=i=1n(XiX¯)2maxi{(XiX¯)2}=sx2maxi{(XiX¯)2}.(19) Finally, d=Σ=σ2/sx2, v=βˆ2/[d(1+ne)], and S2=1σ2|Y|2(i=1n(xix¯)yi)2i=1n(xix¯)2=1σ2(|Y|2sx2βˆ2) complete the terms needed to define PBIC and PBIC* under M1. Under M0, we only need S2=(1/σ2)|Y|2; thus, with v=βˆ2/[σ2(sx2+(maxi{(XiX¯)2})1)], ΔPBIC=sx2βˆ2σ2+log1+sx2maxi{(XiX¯)2}2log1ev2v.ΔPBIC* is the obvious modification of this.

5.3. Testing equality of two means with unequal variances

Consider comparing two normal means via the test H0:μ1=μ2 versus H1:μ1μ2, where the associated known variances, σ12 and σ22 are not equal. The linear model is thus Y=Xμ+ε=10100101μ1μ2+ε11ε2n2,×εN(0,diag{σ12,,σ12n1,σ22,,σ22n2}). Defining α=(μ1+μ2)/2 and β=(μ1μ2)/2 places this in the linear model comparison framework, where we are comparing M0:β=0 versus M1:β0 with the covariate matrix B=X11111=1211111111. Under M1, computation yields X~=n2nσ22,,n2nσ22,n1nσ12,,n1nσ12  withn=n1σ12+n2σ221, so that d=Σ=σ12n1+σ22n2. Also, from (Equation14), ne=σ12n1+σ22n2maxσ12/n12,σ22/n22=minn12σ12,n22σ22σ12n1+σ22n2, and v=βˆ2/[d(1+ne)].

A special case is the standard test of equality of means when σ12=σ22=σ2. Then ne=minn11+n1n2,n21+n2n1. While this may look unusual, looking at the extremes indicates why this is reasonable. Indeed, as say n1, note that nen2. In this scenario, we perfectly learn μ1, so the test of mean equality is really just a test that μ2 equals this known mean, based on n2 observations. Attempting to utilise BIC with an adhoc choice of n, such as (n1+n2)/2, would clearly be a disaster here.

5.4. Findley's counterexample to BIC

For the simple linear model in (Equation2), computation yields that, under H1:θ0, d=Σ=i=1n1i1,ne=i=1n1i,S2=|Y|2θˆ2i=1n1i. It follows that ΔPBIC=θˆ2i=1n1i+log(1+i=1n1i)2log1ev2v,v=θˆ2d(1+ne). Since i=1n(1/i)=logn+O(1) and θˆ2θ2 (because the mle is consistent here), ΔPBIC=θ2(logn+O(1))+log(logn)2log1eθ22θ2+o(1). Under H0:θ=0, ΔPBIC=log(logn)+log2+o(1) and, under H1:θ0, ΔPBIC. Thus PBIC is consistent as n. Essentially the same argument shows that PBIC* is consistent.

5.5. Consistency of PBIC and PBIC* as p in the group means problem

Bayes model selection rules for fixed priors and fixed p are virtually always consistent as the sample size n. This type of consistency transfers over to rules such as PBIC and PBIC* because the priors from which they arise converge to fixed priors as n with p fixed.

There is nothing within Bayesian theory, however, that guarantees consistency of Bayes rules when the dimension p also grows. Indeed, it turns out that consistency is then a very delicate property, that can easily be violated by even standard Bayes rules. The group means problem provides a simple illustration.

Example 5.1

Consider the group means problem with known σ2=1 and effective sample size ni=r fixed, and reduce to the sufficient statistics X¯iN(μi,1/r) for i=1,,p. Consider comparison of the null model M0:μ1==μp=0 with the full model M1: all μi non-zero. Suppose the μi are independently assigned N(0,τi2) priors. Then it is easy to show that consistency obtains under M1 as p if and only if Vlimp(1/p)ipX¯i2 satisfies Vlimp(1/p)ipτi2, assuming the limits exist. (This example was brought to our attention by J. K. Ghosh.)

After reflecting upon this, it might seem surprising that any prior could achieve consistency as p. However, Berger et al. (Citation2003) computed Laplace approximations to the marginal density, for this problem, that produced consistent Bayes factors when p grows with n. They used a multivariate Cauchy prior, which does not result in a closed form Bayes factor, as arises with PBIC and PBIC*. The next theorem indicates the situation involving consistency for PBIC and PBIC*.

Theorem 5.2

For the group means problem with fixed r, PBIC and PBIC* are consistent under M0 as p. Under M1 and assuming that τ2=limp(1/p)ipμi2 exists, PBIC and PBIC* are (20) consistent if τ2>log2+log(1+r)+1r;inconsistent if τ2<log2+log(1+r)1r.(20)

Proof.

We utilise (Equation17) and (Equation18) as the definitions of PBIC and PBIC*, but will ignore C since it is common to all models. Note that the nie=r, S12=i=1pj=1r(xijx¯i)2, S02=S12+ri=1px¯i2, vi=rx¯i2/(r+1) under M1 and vi=0 under M0. Thus PBIC and PBIC* become, with subscripts denoting the model, PBIC0=PBIC*0=S02=S12+ri=1px¯i2,PBIC1=S12+plog(r+1)2i=1plog1evi2 vi=S12+plog[2(r+1)]2i=1plog1evi vi,PBIC*1=S12+plog(r+1)2i=1plog1emin{vi,1.3}2 vi min{vi,1.3}=S12+plog[2(r+1)]2i=1plog1emin{vi,1.3} vi min{vi,1.3}. It is straightforward to show that 1evi vi<1and1emin{vi,1.3} vi min{vi,1.3}<1, so that ΔPBIC=PBIC1PBIC0 and ΔPBIC*=PBIC*1PBIC*0 satisfy ΔPBIC (ΔPBIC*)>plog[2(r+1)]ri=1px¯i2A(p). Under M0, ri=1px¯i2χp2, so that A(p)=plog[2(r+1)]p1+O1p as p, establishing consistency under M0.

To show inconsistency under M1, note that ri=1px¯i2χp2(λp), with non-centrality parameter λp=ri=1pμi2. Thus A(p)=plog[2(r+1)](p+λp)1+O1p+λp if τ2=limpλp/[rp]<(log[2(1+r)1])/r, establishing the inconsistency result.

To investigate consistency of PBIC and PBIC* under M1, note that (21) 1emin{v,1.3} v min{v,1.3}1evv11+v.(21) Also, because of concavity, E[log(1+v)]log(1+E[v]). Thus, Elog1eviviE[log(1+vi)]log(1+E[vi])=log1+1+rμi21+r. Using this inequality and the fact that iωi1/p(ωi)/p), it follows that 2p Ei=1plog1evivi2logi=1p1+1+rμi21+r1/p2log1pi=1p1+1+rμi21+r=2log2+r+λp/p1+r. Hence, by the law of large numbers, limp1pΔPBIClog[2(r+1)]+2log2+r+r τ21+rlimp1+λpp1+O1p+λ=log[2(r+1)]+2log2+r+r τ21+r(1+rτ2)(1+o(1)). Let B(r,τ2) denote the right hand side above (without the o(1) term). If B(r,τ2)<0, then ΔPBIC goes to as p, and we have consistency.

Differentiating with respect to τ2 shows that B(r,τ2) is decreasing in τ2, so that, if we can find a value of τ2 for which B(r,τ2)<0, then any larger value of τ2 will also work. As a candidate, consider τc2=[c+log(1+r)]/r. Then B(r,τc2)=log[2(r+1)]+2log2+r+c+log(1+r)1+r(1+c+log(1+r)). Differentiating this with respect to r shows that it is decreasing in r so that all we need to show is that τc2 works for r=1. Indeed, B(1,τc2)=log[4]+2log3+c+log22(1+c+log2)<0, if c>1.67. Since 1+log2=1.693>1.67, the condition for consistency of PBIC in the theorem is established. And because of (Equation21), the same condition ensures that PBIC* is consistent.

Note that, if r is moderately large, PBIC and PBIC* are consistent under M1, unless τ2 is extremely close to 0, i.e. unless the non-zero means are extremely close to 0; it is not surprising that it is difficult to distinguish between M1 and M0 in this situation.

There is a gap in the theorem between the consistency and inconsistency conditions under M1. The gap is quite large for small r, but shrinks as r grows. A more refined analysis would reduce the gap, but the theorem does convey the basic messages about consistency.

More generally, M0 could be a group means model containing some zero and some non-zero means. If M0 is nested in M1 and the number of additional non-zero means in M1 goes to ∞, then the theorem still applies, since the common non-zero means will be integrated out at the beginning and will not affect the analysis.

Acknowledgements

This research was begun under the auspices of the 2004–2005 SAMSI program on Latent Variables in the Social Sciences.

Disclosure statement

No potential conflict of interest was reported by the authors.

Additional information

Funding

M. J. Bayarri's research was supported by the Spanish Ministry of Education and Science [grant number MTM2010-19528]; James Berger's research was supported by USA National Science Foundation [grant numbers DMS-1007773 and DMS-1407775]; Woncheol Jang's research was supported by the National Research Foundation of Korea (NRF) grants funded by the Korea government (MSIP), No. 2014R1A4A1007895 and No. 2017R1A2B2012816; Luis Pericchi's research was supported by grant CA096297/CA096300 from the USA National Cancer Institute of the National Institutes of Health.

References

  • Bayarri, M. J., Berger, J. O., Forte, A., & García-Donato, G. (2012). Criteria for Bayesian model choice with application to variable selection. The Annals of Statistics, 40(3), 1550–1577. doi: 10.1214/12-AOS1013
  • Berger, J. O. (1985). Statistical decision theory and Bayesian analysis (2nd ed.). New York: Springer-Verlag.
  • Berger, J. O., Bayarri, M. J., & Pericchi, L. R. (2014). The effective sample size. Econometric Reviews, 33, 197–217. doi: 10.1080/07474938.2013.807157
  • Berger, J. O., Ghosh, J. K., & Mukhopadhyay, N. (2003). Approximations and consistency of bayes factors as model dimension grows. Journal of Statistical Planning and Inference, 112, 241–258. doi: 10.1016/S0378-3758(02)00336-1
  • Berger, J. O., Pericchi, L. R., & Varshavsky, J. A. (1998). Bayes factors and marginal distributions in invariant situations. Sankhya: The Indian Journal of Statistics. Series A, 60, 307–321.
  • Berger, J. O., & Sellke, T. (1987). Testing a point null hypothesis: The irreconcilability of p values and evidence. Journal of the American statistical Association, 82, 112–122.
  • Bollen, K. A., Ray, S., Zavisca, J., & Harden, J. J. (2012). A comparison of Bayes factor approximation methods including two new methods. Sociological Methods and Research, 41, 294–324. doi: 10.1177/0049124112452393
  • Chakrabarti, A., & Ghosh, J. K. (2006). A generalization of BIC for the general exponential family. Journal of Statistical Planning and Inference, 136(9), 2847–2872. doi: 10.1016/j.jspi.2005.01.005
  • Drton, M., & Plummer, M. (2017). A Bayesian information criterion for singular models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(2), 323–380. doi: 10.1111/rssb.12187
  • Dudley, R. M., & Haughton, D. (1997). Information criteria for multiple data sets and restricted parameters. Statistica Sinica, 7, 265–284.
  • Findley, D. F. (1991). Counterexamples to parsimony and BIC. Annals of the Institute of Statistical Mathematics, 43, 505–514. doi: 10.1007/BF00053369
  • Foygel, R., & Drton, M. (2010). Extended Bayesian information criteria for Gaussian graphical models. In Advances in neural information processing systems (pp. 604–612).
  • Haughton, D. (1988). On the choice of a model to fit data from an exponential family. The Annals of Statistics, 16(1), 342–355. doi: 10.1214/aos/1176350709
  • Haughton, D. (1991). Consistency of a class of information criteria for model selection in non-linear regression. Communications in Statistics: Theory and Methods, 20, 1619–1629. doi: 10.1080/03610929108830587
  • Haughton, D. (1993). Consistency of a class of information criteria for model selection in nonlinear regression. Theory of Probability and its Applications, 37, 47–53. doi: 10.1137/1137009
  • Haughton, D., Oud, J., & Jansen, R. (1997). Information and other criteria in structural equation model selection. Communications in Statistics, Part B – Simulation and Computation, 26(4), 1477–1516. doi: 10.1080/03610919708813451
  • Jeffreys, H. (1961). Theory of probability. London: Oxford University Press.
  • Kass, R. E., & Raftery, A. (1995). Bayes factors. Journal of the American Statistical Association, 90, 773–795. doi: 10.1080/01621459.1995.10476572
  • Kass, R. E., & Vaidyanathan, S. K. (1992). Approximate Bayes factors and orthogonal parameters, with application to testing equality of two binomial proportions. Journal of the Royal Statistical Society, 54, 129–144.
  • Kass, R. E., & Wasserman, L. (1995). A reference Bayesian test for nested hypotheses and its relationship to the Schwarz criterion. Journal of the American Statistical Association, 90, 928–934. doi: 10.1080/01621459.1995.10476592
  • Pauler, D. (1998). The Schwarz criterion and related methods for normal linear models. Biometrika, 85, 13–27. doi: 10.1093/biomet/85.1.13
  • Raftery, A. E. (1999). Bayes factors and BIC – comment on ‘A critique of the Bayesian information criterion for model selection’. Sociological Methods and Research, 27, 411–427. doi: 10.1177/0049124199027003005
  • Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6, 461–464. doi: 10.1214/aos/1176344136
  • Shen, G., & Ghosh, J. K. (2011). Developing a new BIC for detecting change-points. Journal of Statistical Planning and Inference, 141(4), 1436–1447. doi: 10.1016/j.jspi.2010.10.017
  • Stone, M. (1979). Comments on model selection criteria of Akaike and Schwarz. Journal of the Royal Statistical Society, Series B, 41, 276–278.
  • Strawderman, W. E. (1971). Proper Bayes minimax estimators of the multivariate normal mean. The Annals of Mathematical Statistics, 42(1), 385–388. doi: 10.1214/aoms/1177693528
  • Tierney, L., Kass, R. E., & Kadane, J. B. (1989). Fully exponential Laplace approximations to expectations and variances of nonpositive functions. Journal of the American Statistical Association, 84(407), 710–716. doi: 10.1080/01621459.1989.10478824
  • Zak-Szatkowska, M., & Bogdan, M. (2011). Modified versions of Bayesian information criterion for sparse generalized linear models. Computational Statistics and Data Analysis, 55, 2908–2924. doi: 10.1016/j.csda.2011.04.016

Appendix

To see that the prior in (Equation9) is almost the same as πC, the Cauchy(0,b) prior (we drop the i subscripts in this appendix), consider the extremes.

Theorem A.1

For bd, lim|ξ|πC(ξ)πR(ξ)=2bπ(b+d)(0.80,1.13),πC(0)πR(0)=2dbπ(b+dbd)(0.80,1.13).

Proof.

Note that πR(0)=12π011d+b2λddλ=b+dbd2dπ. Hence πC(0)πR(0)=2dbπ(b+dbd). It is straightforward to show that b(b+dbd) is decreasing in bd, with a maximum of 2d and minimum of d. Thus 2/ππC(0)/πR(0)4/π, which (to 2 decimal places) is the result above.

To prove the result as |ξ|, separately integrate over Γ1=(0,|ξ|3/2) and Γ2=(|ξ|3/2,1) in (Equation9). For λΓ1, note that (d+b2λd)1=(d+b)1+O(|ξ|3/2), so that (d+b2λd)1/2=(d+b)1/2+O(|ξ|3/2),expξ2λd+b2λd=expξ2d+bλ+O(|ξ|3)=expξ2d+b1+O(|ξ|1). Hence the integral over Γ1 is 12π0|ξ|3/21d+b+O(|ξ|3/2)×expξ2λd+b1+O(|ξ|1dλ=d+b2πξ21exp|ξ|d+b1+O(|ξ|1.

Noting that exp(ξ2λ/[d+b2λd]) is decreasing in λ, it is immediate that the integral over Γ2 is bounded above by exp(|ξ|/[d+b])2πd|ξ|3/211d+b2λddλ=exp(|ξ|/[d+b])2πd×d+b2|ξ|3/2dbd=o(|ξ|2).

It follows that lim|ξ|πC(ξ)πR(ξ)=lim|ξ|2π[π1ξ2b(1+O(|ξ2))]d+bξ2(1exp(|ξ|/[d+b]))(1+O(|ξ1))+o(|ξ|2)=2bπ(b+d).

It is straightforward to show that 2/π2b/π(b+d)4/π, yielding (to two decimal places) the conclusion.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.