Search in:

Statistical Theory and Related Fields Volume 3, 2019 - Issue 1

Submit an article Journal homepage

Free access

1,067

Views

CrossRef citations to date

Altmetric

Listen

Discussion Paper and Discussions

Prior-based Bayesian information criterion

M. J. BayarriDepartment of Statistics and Operations Research, University of Valencia, Valencia, SpainView further author information

James O. BergerDepartment of Statistical Science, Duke University, Durham, NC, USACorrespondence[email protected]
[email protected]
View further author information

Woncheol JangDepartment of Statistics, Seoul National University, Seoul, KoreaView further author information

Surajit RaySchool of Mathematics and Statistics, University of Glasgow, Glasgow, UKView further author information

Luis R. PericchiDepartment of Mathematics, University of Puerto Rico, San Juan, Puerto RicoView further author information

Ingmar VisserDepartment of Psychology, University of Amsterdam, Amsterdam, NetherlandsView further author information

Pages 2-13 | Received 24 Jun 2017, Accepted 10 Feb 2019, Published online: 14 Mar 2019

Cite this article
https://doi.org/10.1080/24754269.2019.1582126
CrossMark

In this article

ABSTRACT
1. Background
2. The PBIC solution
3. Defining ‘effective sample size’ nj, for parameter ξj
4. PBIC*: a version more favourable to complex models
5. PBIC and PBIC* for the linear model
Acknowledgements
Disclosure statement
Additional information
References
Appendixes

Full Article
Figures & data
References
Citations
Metrics
Reprints & Permissions
View PDF PDF View EPUB EPUB

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

ABSTRACT

We present a new approach to model selection and Bayes factor determination, based on Laplace expansions (as in BIC), which we call Prior-based Bayes Information Criterion (PBIC). In this approach, the Laplace expansion is only done with the likelihood function, and then a suitable prior distribution is chosen to allow exact computation of the (approximate) marginal likelihood arising from the Laplace approximation and the prior. The result is a closed-form expression similar to BIC, but now involves a term arising from the prior distribution (which BIC ignores) and also incorporates the idea that different parameters can have different effective sample sizes (whereas BIC only allows one overall sample size n). We also consider a modification of PBIC which is more favourable to complex models.

KEYWORDS:

Bayes factors
model selection
Cauchy priors
consistency
effective sample size
Fisher information
Laplace expansions
robust priors

1. Background

1.1. The original BIC (Schwarz, Citation1978)

Suppose that we observe $X_{i} = (X_{i 1}, \dots, X_{i p}) \sim g (x_{i} ∣ θ)$ for $i = 1, \dots, n$ . Here $θ = (θ_{1}, \dots, θ_{p})$ is a unknown vector and, in Schwartz's derivation of BIC, $g (x ∣ θ)$ is an exponential family. Then the log-likelihood function is $l (θ) = \log f (x ∣ θ) = \log (\prod_{i = 1}^{n} g (x_{i} ∣ θ)),$ where $x = (x_{1}, \dots, x_{n})$ . The goal of Schwarz (Citation1978) is to find a simple approximation to the marginal density $m (x) = \int f (x ∣ θ) π (θ) d θ,$ where $π (θ)$ is a prior density for the unknown $θ$ , and use the approximation for model comparison.

Result 1.1

Stone, Citation1979

Let $\hat{θ}$ be the MLE of $θ$ . Then, under reasonable conditions and as $n \to \infty$ , $BIC \equiv - 2 l (\hat{θ}) + p \log n = - 2 \log m (x) + c + o (1),$ where c is a constant.

Schwartz then suggested comparing two models $M_{1}$ and $M_{2}$ , using $Δ BIC = {BIC}_{2} - {BIC}_{1},$

preferring $M_{2}$ ( $M_{1}$ ) as this is negative (positive). Clearly this is equivalent to basing the model comparison on the Bayes factor (odds) of $M_{2}$ to $M_{1}$ , with the approximation (1) $\begin{aligned} B_{21} & \equiv \frac{m_{2} (x)}{m_{1} (x)} = \frac{\exp (- \frac{1}{2} {BIC}_{2})}{\exp (- \frac{1}{2} {BIC}_{1})} \exp (\frac{1}{2} (c_{2} - c_{1})) \\ \times (1 + o (1)) \approx \frac{\exp (- \frac{1}{2} {BIC}_{2})}{\exp (- \frac{1}{2} {BIC}_{1})} . \end{aligned}$ (1)

1.2. Problems with general use of BIC

BIC is an excellent tool for the class of problems for which it was developed. Unfortunately, it is today used ubiquitously, for completely different classes of problems. We here outline some of the issues with using BIC inappropriately.

Problem 1. The term $\exp (\frac{1}{2} (c_{2} - c_{1}))$ in (Equation1(1) $\begin{aligned} B_{21} & \equiv \frac{m_{2} (x)}{m_{1} (x)} = \frac{\exp (- \frac{1}{2} {BIC}_{2})}{\exp (- \frac{1}{2} {BIC}_{1})} \exp (\frac{1}{2} (c_{2} - c_{1})) \\ \times (1 + o (1)) \approx \frac{\exp (- \frac{1}{2} {BIC}_{2})}{\exp (- \frac{1}{2} {BIC}_{1})} . \end{aligned}$ (1) ) is ignored by BIC.

This could have been a serious problem even with proper use of BIC, except that there happens to be pseudo-prior distributions that yield BIC itself (Raftery, Citation1999), i.e. for which the term $\exp (\frac{1}{2} (c_{2} - c_{1})) = 1$ . These pseudo-priors are not real priors, in that they are centred at the mle's of each model, which is a problematical double use of the data. Nevertheless it is comforting that there is at least some type of prior distribution that yields BIC exactly.

Problem 2. What is $n$ ?

A common mistake in specifying n: Note that, in Schwartz's setup, there are n vector observations of dimension p, so that there are a total of np real observations. It is common to mistakenly use $n^{*} = n p$ as the sample size in BIC, rather than the correct n.
Different parameters can have different n.

Example 1.2

Group means

For $i = 1, \dots, p$ and $l = 1, \dots, r$ , suppose we observe $X_{i l} = μ_{i} + ε_{i l},$ where $ε_{i l} \sim N (0, σ^{2})$ . If $σ^{2}$ were known, this would be exactly the setup of Schwartz, and the sample size for $μ = (μ_{1}, \dots, μ_{p})$ would be r. In effect, each $μ_{i}$ has a sample size of r associated with it. But, if $σ^{2}$ is unknown, the parameter is $θ = (μ_{1}, \dots, μ_{p}, σ^{2})$ and it is not reasonable to also associate the sample size of r to $σ^{2}$ , in that we know there are $p (r - 1)$ degrees of freedom associated with the mle of $σ^{2}$ .

An alternative argument is to note that the observed information matrix $\hat{I} = ({\hat{I}}_{j k})$ , with $(j, k)$ entry ${\hat{I}}_{j k} = - \frac{\partial^{2}}{\partial θ_{j} \partial θ_{k}} \log f (x ∣ θ) |_{θ = \hat{θ}}$ is given by $\hat{I} = (\begin{matrix} \frac{r}{{\hat{σ}}^{2}} I_{p \times p} & 0 \\ 0 & \frac{p r}{2 {\hat{σ}}^{4}} \end{matrix}),$ where ${\hat{σ}}^{2} = (1 / p r) \sum_{i = 1}^{p} \sum_{l = 1}^{r} (X_{i l} - {\bar{X}}_{i})^{2}$ . The information matrix suggests that the effective sample size for each $μ_{i}$ is r, while the effective sample size for $σ^{2}$ is pr. Whether we use $p (r - 1)$ or pr for the sample size associated with $σ^{2}$ will not typically make much difference, whereas the difference with using r, instead, will be quite large.

Different observations can have different observed information content.

Example 1.3

Suppose each independent observation, $X_{i}, i = 1, \dots, n$ , has probability 1/2 of arising from the $N (θ, 1)$ distribution and probability 1/2 of arising from the $N (θ, 1000)$ distribution. Clearly half the observations are essentially worthless, and the ‘effective sample size’ is n/2.

Example 1.4

Findley's BIC counterexample

One of the famous counter examples against inappropriate use of BIC is in Findley (Citation1991). Suppose the observations are (2) $\begin{aligned} X_{i} & = \frac{1}{\sqrt{i}} \cdot θ + ε_{i}, where ε_{i} \sim N (0, 1), \\ i & = 1, \dots, n, \end{aligned}$ (2) and we are comparing the models $H_{0} : θ = 0$ and $H_{1} : θ \neq 0$ . It turns out that the mle for θ is consistent under $H_{1}$ (a necessary condition to apply BIC), but that BIC is inconsistent if $0 < | θ | < 1$ , in that BIC will then declare $H_{0}$ to be the true model as $n \to \infty$ . The problem here is that, even though the information about θ goes to ∞ as n grows, it grows much more slowly than n (actually, the information grows at roughly $\log n$ rate), and BIC erroneously assigns the rate to be n.

Problem 3. What is p?

Just as n is often not clearly defined for use in BIC, the parameter dimension p is often not clearly defined (see also Pauler, Citation1998).

Example 1.5

Random effect group means

Consider hierarchical or random effect versions of the group means problem, where it is assumed that $μ_{i} \sim N (ξ, τ^{2}),$ with ξ and $τ^{2}$ being unknown. The number of parameters might appear to be p+3 (the means, along with $σ^{2}$ , ξ and $τ^{2}$ ), but one could, alternatively, integrate out $μ = (μ_{1}, \dots, μ_{p})$ (since it has a known distribution) obtaining $\begin{aligned} f (x ∣ σ^{2}, ξ, τ^{2}) & = \int f (x ∣ μ, ξ, σ^{2}) π (μ ∣ ξ, τ^{2}) d μ \\ \propto \frac{1}{σ^{- p (r - 1)}} \exp \{\frac{{\hat{σ}}^{2}}{2 σ^{2}}\} \prod_{i = 1}^{p} \\ \times \exp \{- \frac{({\bar{x}}_{i} - ξ)^{2}}{2 (\frac{σ^{2}}{r} + τ^{2})}\} . \end{aligned}$ The marginal likelihood will be the integral of this, with respect to a prior $π (σ^{2}, ξ, τ^{2})$ , so that, if one is really viewing BIC as an approximation to the marginal likelihood, it would be correct to set p=3.

Problem 4. What if p grows with n?

BIC is based on an asymptotic argument with p fixed and n growing, but often p is growing with n; BIC then does not apply. If one were to erroneously apply BIC in such a situation, one could end up with inconsistency, as shone by Stone (Citation1979) for the group means example, with known variance $σ^{2} = 1$ for simplicity. Indeed, in comparing models $H_{0} : μ = 0$ and $H_{1} : μ \neq 0$ for the group means problem with r=2, $Δ BIC = {BIC}_{1} - {BIC}_{0} = - 2 \sum_{i = 1}^{p} {\bar{x}}_{i}^{2} + p \log 2,$ which, under $H_{0}$ , behaves like $p (\log 2 - 1) \to - \infty$ as p grows, thus incorrectly selecting model $H_{1}$ .

1.3. Variants of BIC

Noting the limitations of BIC, researchers have proposed a host of generalisations, many of which have performed better than BIC under specific scenarios. Many of these methods arise from the variations in retaining the number of terms in the Laplace approximation of the Bayes Factor (Kass & Raftery, Citation1995). One variant – called the HBIC – (Haughton, Citation1988) retains the third term in the Laplace approximation of the Bayes Factor. A simulation study by Haughton, Oud, and Jansen (Citation1997) shows that HBIC performs better in model selection for structural equation models than does the usual BIC. Following HBIC, Bollen, Ray, Zavisca, and Harden (Citation2012) developed a similar criterion, called the Information matrix-based Bayesian Information Criterion (IBIC), which retains more terms in the Bayes Factor approximation and outperforms BIC and HBIC in many scenarios. Bollen et al. (Citation2012) also proposed another criterion, named the scaled unit information prior (SPBIC), which generalises the interpretation of the unit information prior in the context of BIC. For approximation of Bayes factors as the model dimension grows, Berger, Ghosh, and Mukhopadhyay (Citation2003) proposed another approximation, named GBIC. Following Berger et al. (Citation2003), a generalisation of BIC for the general exponential family was proposed by Chakrabarti and Ghosh (Citation2006), and a new BIC for change point analysis was proposed by Shen and Ghosh (Citation2011). Some other extensions of BIC include techniques for comparing graphical models (Foygel & Drton, Citation2010), singular models (Drton & Plummer, Citation2017), and sparse models (Zak-Szatkowska & Bogdan, Citation2011).

1.4. Overview of the paper

Section 2 presents a proposal to generalise BIC, in order to overcome the problems mentioned above. It is based on use of a specific (robust) prior distribution in the computation of the approximate marginal likelihood of a model. Section 3 discusses a critical aspect of the definition of PBIC, namely the need to determine the ‘effective sample size’ corresponding to each parameter in a model. Section 4 presents an alternative called PBIC*. It employs an empirical Bayes prior in computation of the marginal likelihood approximation, resulting in answers more favourable to complex models. Section 5 illustrates the use of PBIC and PBIC* in the normal linear model; it is of interest that PBIC and PBIC* correspond to exact marginal likelihoods here. Illustrations in the section are simple linear regression, testing the equality of normal means with known unequal variances, Findley's counterexample, and the group means problem, where consistency results for PBIC and PBIC* are established as $p \to \infty$ .

2. The PBIC solution

We propose a solution to these problems that depends only on software that can compute mle's and observed information matrices. The basis of the solution is a modified Laplace approximation to $m (x)$ for reasonable default priors.

2.1. Two important preliminaries

One should analytically integrate out any parameter that has a distribution given other parameters, if it is possible to do so. For example, in the hierarchical group means example, base the analysis on the marginal likelihood $f (x ∣ σ^{2}, ξ, τ^{2})$ , rather than the full likelihood.

We will be utilising the Laplace approximation, which is most accurate (Kass & Vaidyanathan, Citation1992; Tierney, Kass, & Kadane, Citation1989) if the parameter space is transformed to be all of $ℜ^{p}$ . Transformation to $ℜ^{p}$ will also be necessary for the subsequent step of the analysis. As an illustration, in the (non-multilevel) group means example, transform to $ν = \log σ^{2}$ . Then $θ = (μ_{1}, \dots, μ_{p}, ν) \in ℜ^{(p + 1)}$ . Note that one then works with the transformed mle $\log {\hat{σ}}^{2}$ and the transformed observed information matrix ${\hat{I}}^{*} (μ, ν) = (\begin{matrix} \frac{r}{{\hat{σ}}^{2}} I_{p \times p} & 0 \\ 0 & \frac{p r}{2} \end{matrix}) .$ In the multilevel group means example, both $σ^{2}$ and $τ^{2}$ would need to be transformed in this fashion.

2.2. PBIC and PBIC* definitions

Suppose $θ = (θ_{(1)}, θ_{(2)})$ , where $θ_{(2)}$ denotes the parameters that are common to all models under consideration (e.g. an intercept in linear regression). Changing notation, let p denote the dimension of $θ_{(1)}$ and q denote the dimension of $θ_{(2)}$ ; note that p will typically vary from model to model, while q is fixed. Partition the observed information matrix for a model accordingly, as (3) $\begin{aligned} \hat{I} & = (\begin{matrix} {\hat{I}}_{11} & {\hat{I}}_{12} \\ {\hat{I}}_{21} & {\hat{I}}_{22} \end{matrix}), and define \\ Σ^{- 1} & = {\hat{I}}_{11} - {\hat{I}}_{12} {\hat{I}}_{22}^{- 1} {\hat{I}}_{12}^{t} . \end{aligned}$ (3) (If there are no common parameters to all models, then $Σ = {\hat{I}}^{- 1}$ .) Change variables to $ξ = O θ_{(1)}$ , where $O$ is an orthogonal matrix such that $Σ = O^{t} D O$ , with $D = diag {d_{i}}$ for $i = 1, \dots, p$ , and define $\hat{ξ} = O {\hat{θ}}_{(1)}$ (the transformed mle). The choice of $O$ does not affect the definition below. For each transformed parameter $ξ_{j}$ , let $n_{j}^{e}$ be the effective sample size corresponding to that parameter. This is the most difficult aspect of the construction, but equals the intuitive choices of parameter sample size discussed in the earlier examples; formal definitions will be presented in Section 3. Then PBIC is defined as (4) $\begin{aligned} PBIC & \equiv - 2 l (\hat{θ}) + \log | {\hat{I}}_{22} | + \sum_{i = 1}^{p} \log (1 + n_{i}^{e}) \\ - 2 \sum_{i = 1}^{p} \log \frac{(1 - e^{- v_{i}})}{\sqrt{2} v_{i}}, \end{aligned}$ (4) where $v_{i} = {\hat{ξ}}_{i}^{2} / [d_{i} (1 + n_{i}^{e})]$ . For a certain natural prior distribution, PBIC will be shown to be accurate, as an approximation to $- 2 \log m (x)$ , up to a $o (1)$ term as $n \to \infty$ (for fixed dimension p). Note that, if there are no common parameters to all models, then (5) $\begin{aligned} PBIC & = - 2 l (\hat{θ}) + \sum_{i = 1}^{p} \log (1 + n_{i}^{e}) \\ - 2 \sum_{i = 1}^{p} \log \frac{(1 - e^{- v_{i}})}{\sqrt{2} v_{i}} . \end{aligned}$ (5) In the classic case considered by Schwartz, all $n_{i}^{e}$ would equal a common n, and the first two terms in this expression are then BIC (up to a o(1) term); the ‘constant’ ignored by BIC is the final term in (Equation5(5) $\begin{aligned} PBIC & = - 2 l (\hat{θ}) + \sum_{i = 1}^{p} \log (1 + n_{i}^{e}) \\ - 2 \sum_{i = 1}^{p} \log \frac{(1 - e^{- v_{i}})}{\sqrt{2} v_{i}} . \end{aligned}$ (5) ).

To summarise results in one place, here is an alternative version of the approximation, one which is more favourable to complex models; its development is given in Section 4: (6) $\begin{aligned} {PBIC}^{*} & \equiv - 2 l (\hat{θ}) + \log | {\hat{I}}_{22} | + \sum_{i = 1}^{p} \log (1 + n_{i}^{e}) \\ - 2 \sum_{i = 1}^{p} \log \frac{(1 - e^{- min {v_{i}, 1.3}})}{\sqrt{2 v_{i} min {v_{i}, 1.3}}} . \end{aligned}$ (6) Note that, if dealing with only normal mean parameters, PBIC and PBIC* are exact as an approximation to $- 2 \log m (x)$ , as discussed below. This would mean, for instance, that when dealing with $p \to \infty$ , there would be no need to worry about the accuracy of the approximations.

Here are the steps in the derivation of PBIC.

2.2.1. Laplace approximation

By a Taylor's series expansion of $l (θ)$ about the mle $\hat{θ}$ , with ∇ denoting the gradient and $\hat{I}$ being the observed information matrix as defined earlier, (7) $\begin{aligned} m (x) & = \int f (x ∣ θ) π (θ) d θ \\ = \int e^{l (θ)} π (θ) d θ \\ = \int \exp [l (\hat{θ}) + (θ - \hat{θ})^{t} \nabla l (\hat{θ}) \\ - \frac{1}{2} (θ - \hat{θ})^{t} \hat{I} (θ - \hat{θ})] π (θ) d θ (1 + o (1)) \\ = e^{l (\hat{θ})} \int e^{- (1 / 2) (θ - \hat{θ})^{t} \hat{I} (θ - \hat{θ)}} π (θ) d θ (1 + o (1)), \end{aligned}$ (7) where $o (1)$ denotes a term that goes to zero as the sample size n grows. Technical conditions for the validity of this Laplace approximation can be found in, e.g. Tierney et al. (Citation1989), Kass and Vaidyanathan (Citation1992); the key assumption needed is that $\hat{θ}$ occurs on the interior of the parameter space, so that $\nabla l (\hat{θ}) = 0$ . (If not true, the analysis must proceed as in Dudley & Haughton, Citation1997; Haughton, Citation1991, Citation1993). Also, the presence of $o (1)$ assumes that p is fixed as n grows. We will nevertheless use this approximation, even as p grows with n, relying on the considerable evidence that the Laplace approximation is quite generally accurate.

Note that we do not use the more common version of the Laplace expansion which involves $π (θ)$ in the Taylor's expansion because we will be choosing $π (θ)$ so that the integral in this expression can be evaluated in closed form. In particular, this means that, if we are dealing with the situation where $θ$ is the mean parameter of a normal model, then the computations herein will be entirely closed form, with no approximation being involved (and no need to then worry about p growing with n).

2.2.2. Choosing a good prior $π (θ)$

Assume that the transformations in Section 2.1 have been made.

Step 1. Recall that $θ = (θ_{(1)}, θ_{(2)})$ , where $θ_{(2)}$ denotes the common parameters to all models. We will utilise a prior distribution $π (θ) = (2 π)^{- q} π (θ_{(1)}),$ where $π (θ_{(1)})$ is defined later. The key point is that, since $θ_{(2)}$ is common to all models, it can be assigned a constant prior density (see, e.g. Bayarri, Berger, Forte, & García-Donato, Citation2012; Berger, Pericchi, & Varshavsky, Citation1998); choosing the constant to be $(2 π)^{- q}$ is to simplify the resulting expression. With the definitions given in (Equation3(3) $\begin{aligned} \hat{I} & = (\begin{matrix} {\hat{I}}_{11} & {\hat{I}}_{12} \\ {\hat{I}}_{21} & {\hat{I}}_{22} \end{matrix}), and define \\ Σ^{- 1} & = {\hat{I}}_{11} - {\hat{I}}_{12} {\hat{I}}_{22}^{- 1} {\hat{I}}_{12}^{t} . \end{aligned}$ (3) ), integrating out $θ_{(2)}$ results in the expression $\begin{aligned} m (x) & = e^{l (\hat{θ})} \int \exp (- \frac{1}{2} (θ - \hat{θ})^{t} \hat{I} (θ - \hat{θ})) \\ \times (2 π)^{- q} d θ_{(2)} π (θ_{(1)}) d θ_{(1)} (1 + o (1)) \\ = e^{l (\hat{θ})} | \hat{I} |^{- 1 / 2} \\ \times \int \frac{\exp (- \frac{1}{2} (θ_{(1)} - {\hat{θ}}_{(1)})^{t} Σ^{- 1} (θ_{(1)} - {\hat{θ}}_{(1)}))}{| Σ |^{1 / 2}} \\ \times π (θ_{(1)}) d θ_{(1)} (1 + o (1)) . \end{aligned}$

Step 2. Change variables to $ξ = O θ_{(1)}$ , where $O$ is an orthogonal matrix such that $Σ = O^{t} D O$ , with $D = diag (d_{i})$ for $i = 1, \dots, p$ . (The choice of $O$ does not matter in the following.) Note that $\hat{ξ} = O {\hat{θ}}_{(1)}$ .

For this model, we will utilise a prior distribution that is independent in the $ξ_{i}$ , i.e. $π (ξ) = \prod_{i = 1}^{p} π_{i} (ξ_{i})$ . Then we can write (8) $\begin{aligned} m (x) & = e^{l (\hat{θ})} | \hat{I} |^{- 1 / 2} \\ \times [\prod_{i = 1}^{p} \int \frac{1}{\sqrt{d_{i}}} e^{- ((ξ_{i} - {\hat{ξ}}_{i})^{2} / 2 d_{i})} π_{i} (ξ_{i}) d ξ_{i}] \\ \times (1 + o (1)) . \end{aligned}$ (8) For $π_{i} (ξ_{i})$ , in a similar situation, Jeffreys (Citation1961) recommended the Cauchy $(0, b_{i})$ density $(1 / π \sqrt{b_{i}}) (1 / (1 + ξ_{i}^{2} / b_{i}))$ , where $b_{i}$ is chosen to represent unit information for $ξ_{i}$ (see Kass & Wasserman, Citation1995; also to be discussed later). A prior that yields almost the same results is (9) $π_{i}^{R} (ξ_{i}) = \int_{0}^{1} N (ξ_{i} | 0, \frac{1}{2 λ_{i}} (d_{i} + b_{i}) - d_{i}) \frac{1}{2 \sqrt{λ_{i}}} d λ_{i},$ (9) which is well-defined if $b_{i} \geq d_{i}$ . Interestingly, this prior is very similar to the Cauchy prior no matter what $d_{i}$ happens to be (as shown in the Appendix), so we will interpret this prior (and $b_{i}$ ) exactly as we would with the Cauchy prior. The attraction of $π^{R}$ is that the ensuing computations can be done in closed form. That one can have all the advantages that Jeffreys pointed out are possessed by the Cauchy prior for model selection, while maintaining closed form expressions, is a significant advantage when dealing with large model spaces. This prior was extensively discussed in Berger (Citation1985), as a robust prior (hence the R label) for estimation problems, but its even greater value for model selection was not recognised. (This type of prior was first utilised in Strawderman (Citation1971) in shrinkage estimation.) See also Bayarri et al. (Citation2012), where a multivariate version of this prior is utilised for model selection in normal linear models.

With the prior in (Equation9(9) $π_{i}^{R} (ξ_{i}) = \int_{0}^{1} N (ξ_{i} | 0, \frac{1}{2 λ_{i}} (d_{i} + b_{i}) - d_{i}) \frac{1}{2 \sqrt{λ_{i}}} d λ_{i},$ (9) ), the integral in (Equation8(8) $\begin{aligned} m (x) & = e^{l (\hat{θ})} | \hat{I} |^{- 1 / 2} \\ \times [\prod_{i = 1}^{p} \int \frac{1}{\sqrt{d_{i}}} e^{- ((ξ_{i} - {\hat{ξ}}_{i})^{2} / 2 d_{i})} π_{i} (ξ_{i}) d ξ_{i}] \\ \times (1 + o (1)) . \end{aligned}$ (8) ) is straightforward to evaluate in closed form (first integrate over $ξ_{i}$ , then over $λ_{i}$ ) yielding (10) $\begin{aligned} m (x) & = e^{l (\hat{θ})} | \hat{I} |^{- 1 / 2} \\ \times [\prod_{i = 1}^{p} \frac{1}{\sqrt{(d_{i} + b_{i})}} \frac{(1 - e^{- {\hat{ξ}}_{i}^{2} / (d_{i} + b_{i})})}{\sqrt{2} {\hat{ξ}}_{i}^{2} / (d_{i} + b_{i})}] \\ \times (1 + o (1)) . \end{aligned}$ (10)

Step 3. Define the unit information, $b_{i}$ , by (11) $\begin{aligned} b_{i} & = n_{i}^{e} d_{i}, where \\ n_{i}^{e} & = effective sample size for ξ_{i} and recall \\ v_{i} & = \frac{{\hat{ξ}}_{i}^{2}}{d_{i} (1 + n_{i}^{e})} . \end{aligned}$ (11) Definitions of the effective sample size will be given in Section 3. It will be the case that $n_{i}^{e} \geq 1$ so that $b_{i} \geq d_{i}$ (the condition mentioned earlier for $π^{R}$ to be well defined). Then (Equation10(10) $\begin{aligned} m (x) & = e^{l (\hat{θ})} | \hat{I} |^{- 1 / 2} \\ \times [\prod_{i = 1}^{p} \frac{1}{\sqrt{(d_{i} + b_{i})}} \frac{(1 - e^{- {\hat{ξ}}_{i}^{2} / (d_{i} + b_{i})})}{\sqrt{2} {\hat{ξ}}_{i}^{2} / (d_{i} + b_{i})}] \\ \times (1 + o (1)) . \end{aligned}$ (10) ) becomes $\begin{aligned} m (x) & = e^{l (\hat{θ})} \frac{| \hat{I} |^{- 1 / 2}}{| D |^{1 / 2}} \prod_{i = 1}^{p} \\ \times \frac{1}{\sqrt{1 + n_{i}^{e}}} \frac{(1 - e^{- v_{i}})}{\sqrt{2} v_{i}} (1 + o (1)) . \end{aligned}$ Since $| \hat{I} | = | Σ^{- 1} | | {\hat{I}}_{22} | = | D^{- 1} | | {\hat{I}}_{22} |$ , we thus have that $- 2 \log m (x) = PBIC + o (1),$ with PBIC defined in (Equation4(4) $\begin{aligned} PBIC & \equiv - 2 l (\hat{θ}) + \log | {\hat{I}}_{22} | + \sum_{i = 1}^{p} \log (1 + n_{i}^{e}) \\ - 2 \sum_{i = 1}^{p} \log \frac{(1 - e^{- v_{i}})}{\sqrt{2} v_{i}}, \end{aligned}$ (4) ).

3. Defining ‘effective sample size’ $n_{j}$ , for parameter $ξ_{j}$

The most difficult aspect of dealing with PBIC turns out to be defining the effective sample size corresponding to a parameter. We first present a solution for linear models, and then suggest a possible solution for the general case.

3.1. Effective sample sizes in linear models

Suppose that all models under consideration are linear models of the form (12) $Y = X^{*} α + X β + ε, where ε \sim N (0, Γ), Γ known,$ (12) with dimensions $Y_{[n \times 1]}, X_{[n \times q]}^{*}, α_{[q \times 1]}, X_{[n \times p]}, β_{[p \times 1]}, ε_{[n \times 1]}$ and $Γ_{[n \times n]}$ . Here $X^{*} α$ is a common term present in all models (e.g. an intercept in linear regression), but $X β$ will differ from model to model. This fits into the framework for PBIC by defining $θ_{(1)} = β$ and $θ_{(2)} = α$ .

Since $α$ will be integrated out in PBIC, only the effective sample size for linear functions of $β$ will be needed. The first step of the process is to orthogonalise the parameters by transforming $α$ to $α^{*} = α + ({X^{*}}^{t} Γ^{- 1} X^{*})^{- 1} {X^{*}}^{t} Γ^{- 1} X β$ and defining (13) $\tilde{X} = (I - X^{*} ({X^{*}}^{t} Γ^{- 1} X^{*})^{- 1} {X^{*}}^{t} Γ^{- 1}) X .$ (13) Since $X^{*} α + X β = X^{*} α^{*} + \tilde{X} β$ , the linear part of the model has not changed in this reparameterisation, but now ${\tilde{X}}^{t} Γ^{- 1} X^{*} = 0$ , so that $α^{*}$ and $β$ are orthogonal. There are two important aspects of this. First, since $X^{*}$ has not been altered, the new $α^{*}$ can still be considered common parameters in each model, and will be integrated out in PBIC, so that their changed definition is irrelevant. Second, $β$ has not been transformed, crucial because we wish effective sample sizes for linear functions of $β$

Write $Γ = σ R σ$ , with $σ = diag {σ_{1}, \dots, σ_{p}}$ , where $R$ is the correlation matrix, and define $C_{[p \times p]}$ to be the diagonal matrix with entries $c_{i i} = max_{j} {| {\tilde{X}}_{j i} | / σ_{j}}$ . Berger, Bayarri, and Pericchi (Citation2014) gave, as the general definition of the effective sample size (called TESS), for any scalar linear transformation $ξ = v β$ ( $v$ is $[1 \times p]$ ) of $β$ , (14) $n^{e} = \frac{| v |^{2}}{v C ({\tilde{X}}^{t} Γ^{- 1} \tilde{X})^{- 1} C v^{t}} .$ (14)

Example 3.1

Group means example

Assume $Y_{i j} = μ_{i} + ε_{i j}$ for $i = 1, \dots, p$ groups, and $j = 1, \dots, r_{i}$ replicates in the ith group, and that the $ε_{i j}$ are i.i.d. $N (0, σ^{2})$ . Computation yields that TESS for $μ_{i}$ is $n_{i}^{e} = r_{i}$ , as is to be expected. Note that $r_{i}$ could be 1, which can be seen to be the lower bound on TESS for linear models when $Γ = σ^{2} I$ .

Example 3.2

Orthogonal and related designs

Assume that $X$ has orthogonal columns with entries $\pm a_{i} \neq 0$ , and that $Γ = σ^{2} I$ . Simple computation here shows that $n_{i}^{e} = n$ for each $β_{i}$ .

Note that the effective sample size here is n, in contrast to the group means problem where the effective sample size can be as low as $r_{i} = 1$ . Indeed, it can be shown that, when $Γ = σ^{2} I$ , TESS will always be between 1 and n, with both limits attainable.

Example 3.3

Heteroscedastic independent observations

Assume $Y_{i} = μ + ε_{i}$ , $ε_{i}$ independent, $ε_{i} \sim N (0, σ_{i}^{2}), i = 1, \dots, n$ . Here the effective sample size is $n^{e} = \frac{\sum_{i = 1}^{n} 1 / σ_{i}^{2}}{max_{i} {1 / σ_{i}^{2}}} .$ Consider the particular case where, for $i = 1, \dots, n_{1}$ , we have $Y_{i} \sim N (μ, σ_{1}^{2})$ , whereas for the remaining $n_{2} = n - n_{1}$ observations, $Y_{i} \sim N (μ, σ_{2}^{2})$ , where $σ_{2}^{2}$ is much larger than $σ_{1}^{2}$ ; thus, intuitively, only the first $n_{1}$ observations count. Then, unless $n_{2}$ is large, $n^{e} = \frac{n_{1} / σ_{1}^{2} + n_{2} / σ_{2}^{2}}{1 / σ_{1}^{2}} = n_{1} + n_{2} \frac{σ_{1}^{2}}{σ_{2}^{2}} \approx n_{1} .$

3.2. A general definition of effective sample size

Suppose one has independent observations $(x_{1}, \dots, x_{n})$ . A possible general definition for the ‘effective sample size’ follows from considering the information associated with observation $x_{i}$ arising from the single-observation expected information matrix $I_{i}^{*} = O^{'} (I_{i, j k}^{*}) O$ , where $I_{i, j k}^{*} = - E [\frac{\partial^{2}}{\partial θ_{j} \partial θ_{k}} \log f_{i} (x_{i} ∣ θ)] |_{θ = \hat{θ}} .$ Since $I_{j j}^{*} = \sum_{i = 1}^{n} I_{i, j j}^{*}$ is the expected information about $ξ_{j}$ , a reasonable way to define the effective sample size, $n_{j}^{e}$ , is

define information weights $w_{i j} = I_{i, j j}^{*} / \sum_{k = 1}^{n} I_{k, j j}^{*};$
define the effective sample size for $ξ_{j}$ as $n_{j}^{e} = \frac{I_{j j}^{*}}{\sum_{i = 1}^{n} w_{i j} I_{i, j j}^{*}} = \frac{{(I_{j j}^{*})}^{2}}{\sum_{i = 1}^{n} {(I_{i, j j}^{*})}^{2}} .$

Intuitively, $\sum w_{i j} I_{i, j j}^{*}$ is a weighted measure of the information ‘per observation’, and dividing the total information about $ξ_{j}$ by this information per case seems plausible as an effective sample size.

Unfortunately, this does not seem to always be an effective definition; for instance, it does not reduce to TESS for all linear models. This should thus be viewed as primarily a starting point for future investigations of effective sample size in nonlinear models.

4. PBIC*: a version more favourable to complex models

Recall, from Raftery (Citation1999), that BIC can be thought of as arising from unit information priors for each model that are centred at the model likelihood. This choice of prior seems highly favourable to more complex models, since the prior gives virtually all of its mass to a modest neighbourhood of the likelihood for each model.

In contrast, PBIC utilises unit information priors that are centred at 0 and, hence, can give little mass to the region of high model likelihood. The fat tails of the prior do result in reasonable answers (cf. Bayarri et al., Citation2012; Jeffreys, Citation1961), but it is of interest to investigate an intermediate solution.

The intermediate solution is to keep the prior centred at 0, but choose the scales of the prior, $b_{i}$ , so that the prior will extend out to the likelihood. In our setup, this can be implemented by choosing the $b_{i}$ so as to maximise $m (x)$ in (Equation10(10) $\begin{aligned} m (x) & = e^{l (\hat{θ})} | \hat{I} |^{- 1 / 2} \\ \times [\prod_{i = 1}^{p} \frac{1}{\sqrt{(d_{i} + b_{i})}} \frac{(1 - e^{- {\hat{ξ}}_{i}^{2} / (d_{i} + b_{i})})}{\sqrt{2} {\hat{ξ}}_{i}^{2} / (d_{i} + b_{i})}] \\ \times (1 + o (1)) . \end{aligned}$ (10) ); thus we are effectively choosing the prior in our class that is most favourable to each model. Clearly this does allow the prior to give more mass to the region of high model likelihood, but does not allow complete concentration of mass in this region.

Since this prior is maximising the marginal likelihood among the given class, it can be viewed as the empirical Bayes prior in the class. It was also a choice popularised in the ‘robust Bayes’ literature (cf. Berger & Sellke, Citation1987), and was used in Bollen et al. (Citation2012) to develop a related generalisation of BIC.

The $b_{i}$ that maximises (Equation10(10) $\begin{aligned} m (x) & = e^{l (\hat{θ})} | \hat{I} |^{- 1 / 2} \\ \times [\prod_{i = 1}^{p} \frac{1}{\sqrt{(d_{i} + b_{i})}} \frac{(1 - e^{- {\hat{ξ}}_{i}^{2} / (d_{i} + b_{i})})}{\sqrt{2} {\hat{ξ}}_{i}^{2} / (d_{i} + b_{i})}] \\ \times (1 + o (1)) . \end{aligned}$ (10) ) can easily be seen to be $\begin{aligned} {\hat{b}}_{i} & = max {d_{i}, \frac{{\hat{ξ}}_{i}^{2}}{w} - d_{i}}, with w s.t. e^{w} - 1 \\ = 2 w, or w \approx 1.3. \end{aligned}$ Unfortunately, the resulting version of BIC has serious problems; in particular it will typically not be consistent as $n \to \infty$ in that, if $ξ_{i}$ is zero, the prior will concentrate about zero at such a fast rate that the models with and without $ξ_{i}$ are essentially equivalent (and one will fail to select the model without $ξ_{i}$ with probability approaching 1). This same lack of consistency afflicts the developments in Bollen et al. (Citation2012) and the robust Bayesian choices.

The obvious solution is simply to prevent ${\hat{b}}_{i}$ from becoming too small, and the obvious constraint is to restrict it to the region $[n_{i}^{e} d_{i}, \infty)$ . This yields the recommended choice (15) $b_{i}^{*} = max \{n_{i}^{e} d_{i}, \frac{{\hat{ξ}}_{i}^{2}}{1.3} - d_{i}\} .$ (15) This will avoid inconsistency as $n \to \infty$ in that, as long as $b_{i}^{*} \to c$ with c a non-zero constant, the resulting prior behaves asymptotically when $ξ_{i} = 0$ as a fixed prior, and fixed priors will yield consistency as $n \to \infty$ . (Consistency when the effective sample sizes do not grow is a more delicate issue, discussed in Section 5.5.)

Replacing $b_{i}$ with $b_{i}^{*}$ , (Equation10(10) $\begin{aligned} m (x) & = e^{l (\hat{θ})} | \hat{I} |^{- 1 / 2} \\ \times [\prod_{i = 1}^{p} \frac{1}{\sqrt{(d_{i} + b_{i})}} \frac{(1 - e^{- {\hat{ξ}}_{i}^{2} / (d_{i} + b_{i})})}{\sqrt{2} {\hat{ξ}}_{i}^{2} / (d_{i} + b_{i})}] \\ \times (1 + o (1)) . \end{aligned}$ (10) ) becomes $\begin{aligned} m (x) & = e^{l (\hat{θ})} | \hat{I} |^{- 1 / 2} [\prod_{i = 1}^{p} \frac{1}{\sqrt{d_{i} (1 + n_{i}^{e}) max {1, v_{i} / 1.3}}} \\ \times \frac{(1 - e^{- min {v_{i}, 1.3}})}{\sqrt{2} min {v_{i}, 1.3}}] (1 + o (1)) \\ = e^{l (\hat{θ})} | \hat{I} |^{- 1 / 2} [\prod_{i = 1}^{p} \frac{1}{\sqrt{d_{i} (1 + n_{i}^{e})}} \\ \times \frac{(1 - e^{- min {v_{i}, 1.3}})}{\sqrt{2 v_{i} min {v_{i}, 1.3}}}] (1 + o (1)) . \end{aligned}$ The resulting approximation to $- 2 \log m (x)$ is given in (Equation6(6) $\begin{aligned} {PBIC}^{*} & \equiv - 2 l (\hat{θ}) + \log | {\hat{I}}_{22} | + \sum_{i = 1}^{p} \log (1 + n_{i}^{e}) \\ - 2 \sum_{i = 1}^{p} \log \frac{(1 - e^{- min {v_{i}, 1.3}})}{\sqrt{2 v_{i} min {v_{i}, 1.3}}} . \end{aligned}$ (6) ).

5. PBIC and PBIC* for the linear model

5.1. The expressions

Consider the normal linear model framework in (Equation12(12) $Y = X^{*} α + X β + ε, where ε \sim N (0, Γ), Γ known,$ (12) ) and assume the orthogonalisation discussed there has been carried out. This does not change PBIC, but is more convenient because we can ignore the common orthogonal parameter $α^{*}$ (see Bayarri et al., Citation2012; Berger et al., Citation1998; Jeffreys, Citation1961 for justification), and focus only on the other parameters $β$ , with the associated model (16) $Y = \tilde{X} β + ε, where ε \sim N (0, Γ), Γ known,$ (16) with $\tilde{X}$ given by (Equation13(13) $\tilde{X} = (I - X^{*} ({X^{*}}^{t} Γ^{- 1} X^{*})^{- 1} {X^{*}}^{t} Γ^{- 1}) X .$ (13) ).

Following the PBIC algorithm, note that $Σ^{- 1} = {\tilde{X}}^{'} Γ^{- 1} \tilde{X}$ . Change variables to $ξ = O β$ , where $O$ is an orthogonal matrix such that $Σ = O^{t} D O$ , with $D = diag (d_{i})$ for $i = 1, \dots, p$ . Then, for each $ξ_{j} = O_{j} β$ , define $n_{j}^{e}$ using (Equation14(14) $n^{e} = \frac{| v |^{2}}{v C ({\tilde{X}}^{t} Γ^{- 1} \tilde{X})^{- 1} C v^{t}} .$ (14) ) with $v = O_{j}$ , and let ${\hat{ξ}}_{j} = O_{j} \hat{β}$ , where $\hat{β} = ({\tilde{X}}^{'} Γ^{- 1} \tilde{X})^{- 1} {\tilde{X}}^{'} Γ^{- 1} Y$ . Finally, recalling that $v_{i} = {\hat{ξ}}_{i}^{2} / [d_{i} (1 + n_{i}^{e})]$ , PBIC and PBIC* are given by (17) $\begin{aligned} PBIC & = S^{2} + C + \sum_{i = 1}^{p} \log (1 + n_{i}^{e}) \\ - 2 \sum_{i = 1}^{p} \log \frac{(1 - e^{- v_{i}})}{\sqrt{2} v_{i}} \end{aligned}$ (17) (18) $\begin{aligned} PBIC* & = S^{2} + C + \sum_{i = 1}^{p} \log (1 + n_{i}^{e}) \\ - 2 \sum_{i = 1}^{p} \log \frac{(1 - e^{- min {v_{i}, 1.3}})}{\sqrt{2 v_{i} min {v_{i}, 1.3}}}, \end{aligned}$ (18) where $S^{2}$ is the usual residual sum of squares corresponding to (Equation12(12) $Y = X^{*} α + X β + ε, where ε \sim N (0, Γ), Γ known,$ (12) ) and $C = \log (| Γ |) + n \log (2 π) = \log (| Γ |) + n \log (2 π) .$ Note that C is the same constant in any model under consideration, and hence it can be ignored in comparing models or determining Bayes factors.

In what follows we describe some important Linear Model examples. There are more, including correlated observations and autoregressive models, in Berger et al. (Citation2014).

5.2. Simple linear regression

Let $Y_{i} = α + X_{i} β + ε_{i}, ε_{i} \overset{i . i . d .}{\sim} N (0, σ^{2})$ , so that $\begin{aligned} Y & = (\begin{matrix} 1 & X_{1} \\ ⋮ & ⋮ \\ 1 & X_{n} \end{matrix}) (\begin{matrix} α \\ β \end{matrix}) \\ + (\begin{matrix} ε_{1} \\ ⋮ \\ ε_{n} \end{matrix}), where ε \sim N (0, σ^{2} I) . \end{aligned}$ Suppose we are considering two models $M_{0} : β = 0$ and $M_{1} : β \neq 0$ . Computation under $M_{1}$ yields $\tilde{X} = (X_{1} - \bar{X}, \dots, X_{n} - \bar{X})^{'}$ , so that $Σ = σ^{2} / s_{x}^{2} = σ^{2} / \sum_{i = 1}^{n} (X_{i} - \bar{X})^{2}$ . Also, from (Equation14(14) $n^{e} = \frac{| v |^{2}}{v C ({\tilde{X}}^{t} Γ^{- 1} \tilde{X})^{- 1} C v^{t}} .$ (14) ), (19) $n^{e} = \frac{\sum_{i = 1}^{n} (X_{i} - \bar{X})^{2}}{max_{i} {(X_{i} - \bar{X})^{2}}} = \frac{s_{x}^{2}}{max_{i} {(X_{i} - \bar{X})^{2}}} .$ (19) Finally, $d = Σ = σ^{2} / s_{x}^{2}$ , $v = {\hat{β}}^{2} / [d (1 + n^{e})]$ , and $\begin{aligned} S^{2} & = \frac{1}{σ^{2}} (| Y |^{2} - \frac{(\sum_{i = 1}^{n} (x_{i} - \bar{x}) y_{i})^{2}}{\sum_{i = 1}^{n} (x_{i} - \bar{x})^{2}}) \\ = \frac{1}{σ^{2}} (| Y |^{2} - s_{x}^{2} {\hat{β}}^{2}) \end{aligned}$ complete the terms needed to define PBIC and PBIC* under $M_{1}$ . Under $M_{0}$ , we only need $S^{2} = (1 / σ^{2}) | Y |^{2}$ ; thus, with $v = {\hat{β}}^{2} / [σ^{2} (s_{x}^{- 2} + (max_{i} {(X_{i} - \bar{X})^{2}})^{- 1})]$ , $\begin{aligned} Δ PBIC & = - \frac{s_{x}^{2} {\hat{β}}^{2}}{σ^{2}} + \log (1 + \frac{s_{x}^{2}}{max_{i} {(X_{i} - \bar{X})^{2}}}) \\ - 2 \log \frac{(1 - e^{- v})}{\sqrt{2} v} . \end{aligned}$ $Δ PBIC*$ is the obvious modification of this.

5.3. Testing equality of two means with unequal variances

Consider comparing two normal means via the test $H_{0} : μ_{1} = μ_{2}$ versus $H_{1} : μ_{1} \neq μ_{2}$ , where the associated known variances, $σ_{1}^{2}$ and $σ_{2}^{2}$ are not equal. The linear model is thus $\begin{aligned} Y & = X μ + ε = (\begin{matrix} 1 & 0 \\ ⋮ & ⋮ \\ 1 & 0 \\ 0 & 1 \\ ⋮ & ⋮ \\ 0 & 1 \end{matrix}) (\begin{matrix} μ_{1} \\ μ_{2} \end{matrix}) + (\begin{matrix} ε_{11} \\ ⋮ \\ ε_{2 n_{2}} \end{matrix}), \\ \times ε \sim N (0, diag {\underset{n_{1}}{\underset{⏟}{σ_{1}^{2}, \dots, σ_{1}^{2}}}, \underset{n_{2}}{\underset{⏟}{σ_{2}^{2}, \dots, σ_{2}^{2}}}}) . \end{aligned}$ Defining $α = (μ_{1} + μ_{2}) / 2$ and $β = (μ_{1} - μ_{2}) / 2$ places this in the linear model comparison framework, where we are comparing $M_{0} : β = 0$ versus $M_{1} : β \neq 0$ with the covariate matrix $B = X {(\begin{matrix} 1 & 1 \\ 1 & - 1 \end{matrix})}^{- 1} = \frac{1}{2} (\begin{matrix} 1 & 1 \\ ⋮ & ⋮ \\ 1 & 1 \\ 1 & - 1 \\ ⋮ & ⋮ \\ 1 & - 1 \end{matrix}) .$ Under $M_{1}$ , computation yields $\begin{aligned} \tilde{X} & = {(\frac{n_{2}}{n^{*} σ_{2}^{2}}, \dots, \frac{n_{2}}{n^{*} σ_{2}^{2}}, - \frac{n_{1}}{n^{*} σ_{1}^{2}}, \dots, - \frac{n_{1}}{n^{*} σ_{1}^{2}})}^{'} with \\ n^{*} & = {(\frac{n_{1}}{σ_{1}^{2}} + \frac{n_{2}}{σ_{2}^{2}})}^{- 1}, \end{aligned}$ so that $d = Σ = (\frac{σ_{1}^{2}}{n_{1}} + \frac{σ_{2}^{2}}{n_{2}}) .$ Also, from (Equation14(14) $n^{e} = \frac{| v |^{2}}{v C ({\tilde{X}}^{t} Γ^{- 1} \tilde{X})^{- 1} C v^{t}} .$ (14) ), $\begin{aligned} n^{e} & = \frac{(\frac{σ_{1}^{2}}{n_{1}} + \frac{σ_{2}^{2}}{n_{2}})}{max \{σ_{1}^{2} / n_{1}^{2}, σ_{2}^{2} / n_{2}^{2}\}} \\ = min \{\frac{n_{1}^{2}}{σ_{1}^{2}}, \frac{n_{2}^{2}}{σ_{2}^{2}}\} (\frac{σ_{1}^{2}}{n_{1}} + \frac{σ_{2}^{2}}{n_{2}}), \end{aligned}$ and $v = {\hat{β}}^{2} / [d (1 + n^{e})]$ .

A special case is the standard test of equality of means when $σ_{1}^{2} = σ_{2}^{2} = σ^{2}$ . Then $n^{e} = min \{n_{1} (1 + \frac{n_{1}}{n_{2}}), n_{2} (1 + \frac{n_{2}}{n_{1}})\} .$ While this may look unusual, looking at the extremes indicates why this is reasonable. Indeed, as say $n_{1} \to \infty$ , note that $n^{e} \to n_{2}$ . In this scenario, we perfectly learn $μ_{1}$ , so the test of mean equality is really just a test that $μ_{2}$ equals this known mean, based on $n_{2}$ observations. Attempting to utilise BIC with an adhoc choice of n, such as $(n_{1} + n_{2}) / 2$ , would clearly be a disaster here.

5.4. Findley's counterexample to BIC

For the simple linear model in (Equation2(2) $\begin{aligned} X_{i} & = \frac{1}{\sqrt{i}} \cdot θ + ε_{i}, where ε_{i} \sim N (0, 1), \\ i & = 1, \dots, n, \end{aligned}$ (2) ), computation yields that, under $H_{1} : θ \neq 0$ , $\begin{aligned} d & = Σ = {(\sum_{i = 1}^{n} \frac{1}{i})}^{- 1}, n^{e} = \sum_{i = 1}^{n} \frac{1}{i}, \\ S^{2} & = | Y |^{2} - {\hat{θ}}^{2} \sum_{i = 1}^{n} \frac{1}{i} . \end{aligned}$ It follows that $\begin{aligned} Δ PBIC & = - {\hat{θ}}^{2} \sum_{i = 1}^{n} \frac{1}{i} + \log (1 + \sum_{i = 1}^{n} \frac{1}{i}) \\ - 2 \log \frac{(1 - e^{- v})}{\sqrt{2} v}, v = \frac{{\hat{θ}}^{2}}{d (1 + n^{e})} . \end{aligned}$ Since $\sum_{i = 1}^{n} (1 / i) = \log n + O (1)$ and ${\hat{θ}}^{2} \to θ^{2}$ (because the mle is consistent here), $\begin{aligned} Δ PBIC & = - θ^{2} (\log n + O (1)) + \log (\log n) \\ - 2 \log \frac{(1 - e^{- θ^{2}})}{\sqrt{2} θ^{2}} + o (1) . \end{aligned}$ Under $H_{0} : θ = 0$ , $Δ PBIC = \log (\log n) + \log 2 + o (1) \to \infty$ and, under $H_{1} : θ \neq 0$ , $Δ PBIC \to - \infty$ . Thus PBIC is consistent as $n \to \infty$ . Essentially the same argument shows that PBIC* is consistent.

5.5. Consistency of PBIC and PBIC* as $p \to \infty$ in the group means problem

Bayes model selection rules for fixed priors and fixed p are virtually always consistent as the sample size $n \to \infty$ . This type of consistency transfers over to rules such as PBIC and PBIC* because the priors from which they arise converge to fixed priors as $n \to \infty$ with p fixed.

There is nothing within Bayesian theory, however, that guarantees consistency of Bayes rules when the dimension p also grows. Indeed, it turns out that consistency is then a very delicate property, that can easily be violated by even standard Bayes rules. The group means problem provides a simple illustration.

Example 5.1

Consider the group means problem with known $σ^{2} = 1$ and effective sample size $n_{i} = r$ fixed, and reduce to the sufficient statistics ${\bar{X}}_{i} \sim N (μ_{i}, 1 / r)$ for $i = 1, \dots, p$ . Consider comparison of the null model $M_{0} : μ_{1} = \dots = μ_{p} = 0$ with the full model $M_{1} : all μ_{i} non-zero$ . Suppose the $μ_{i}$ are independently assigned $N (0, τ_{i}^{2})$ priors. Then it is easy to show that consistency obtains under $M_{1}$ as $p \to \infty$ if and only if $V \equiv lim_{p \to \infty} (1 / p) \sum_{i}^{p} {\bar{X}}_{i}^{2}$ satisfies $V \geq lim_{p \to \infty} (1 / p) \sum_{i}^{p} τ_{i}^{2}$ , assuming the limits exist. (This example was brought to our attention by J. K. Ghosh.)

After reflecting upon this, it might seem surprising that any prior could achieve consistency as $p \to \infty$ . However, Berger et al. (Citation2003) computed Laplace approximations to the marginal density, for this problem, that produced consistent Bayes factors when p grows with n. They used a multivariate Cauchy prior, which does not result in a closed form Bayes factor, as arises with PBIC and PBIC*. The next theorem indicates the situation involving consistency for PBIC and PBIC*.

Theorem 5.2

For the group means problem with fixed r, PBIC and PBIC* are consistent under $M_{0}$ as $p \to \infty$ . Under $M_{1}$ and assuming that $τ^{2} = lim_{p \to \infty} (1 / p) \sum_{i}^{p} μ_{i}^{2}$ exists, PBIC and PBIC* are (20) $\begin{aligned} consistent if τ^{2} \\ > \frac{\log 2 + \log (1 + r) + 1}{r}; inconsistent if τ^{2} \\ < \frac{\log 2 + \log (1 + r) - 1}{r} . \end{aligned}$ (20)

Proof.

We utilise (Equation17(17) $\begin{aligned} PBIC & = S^{2} + C + \sum_{i = 1}^{p} \log (1 + n_{i}^{e}) \\ - 2 \sum_{i = 1}^{p} \log \frac{(1 - e^{- v_{i}})}{\sqrt{2} v_{i}} \end{aligned}$ (17) ) and (Equation18(18) $\begin{aligned} PBIC* & = S^{2} + C + \sum_{i = 1}^{p} \log (1 + n_{i}^{e}) \\ - 2 \sum_{i = 1}^{p} \log \frac{(1 - e^{- min {v_{i}, 1.3}})}{\sqrt{2 v_{i} min {v_{i}, 1.3}}}, \end{aligned}$ (18) ) as the definitions of PBIC and PBIC*, but will ignore C since it is common to all models. Note that the $n_{i}^{e} = r$ , $S_{1}^{2} = \sum_{i = 1}^{p} \sum_{j = 1}^{r} (x_{i j} - {\bar{x}}_{i})^{2}$ , $S_{0}^{2} = S_{1}^{2} + r \sum_{i = 1}^{p} {\bar{x}}_{i}^{2}$ , $v_{i} = r {\bar{x}}_{i}^{2} / (r + 1)$ under $M_{1}$ and $v_{i} = 0$ under $M_{0}$ . Thus PBIC and PBIC* become, with subscripts denoting the model, $\begin{aligned} {PBIC}_{0} & = {PBIC*}_{0} = S_{0}^{2} = S_{1}^{2} + r \sum_{i = 1}^{p} {\bar{x}}_{i}^{2}, \\ {PBIC}_{1} & = S_{1}^{2} + p \log (r + 1) - 2 \sum_{i = 1}^{p} \log \frac{1 - e^{- v_{i}}}{\sqrt{2} v_{i}} \\ = S_{1}^{2} + p \log [2 (r + 1)] - 2 \sum_{i = 1}^{p} \log \frac{1 - e^{- v_{i}}}{v_{i}}, \\ {PBIC*}_{1} & = S_{1}^{2} + p \log (r + 1) \\ - 2 \sum_{i = 1}^{p} \log \frac{(1 - e^{- min {v_{i}, 1.3}})}{\sqrt{2 v_{i} min {v_{i}, 1.3}}} \\ = S_{1}^{2} + p \log [2 (r + 1)] \\ - 2 \sum_{i = 1}^{p} \log \frac{(1 - e^{- min {v_{i}, 1.3}})}{\sqrt{v_{i} min {v_{i}, 1.3}}} . \end{aligned}$ It is straightforward to show that $\frac{1 - e^{- v_{i}}}{v_{i}} < 1 and \frac{(1 - e^{- min {v_{i}, 1.3}})}{\sqrt{v_{i} min {v_{i}, 1.3}}} < 1,$ so that $Δ PBIC = {PBIC}_{1} - {PBIC}_{0}$ and $Δ PBIC* = {PBIC*}_{1} - {PBIC*}_{0}$ satisfy $\begin{aligned} Δ PBIC (Δ PBIC*) \\ > p \log [2 (r + 1)] - r \sum_{i = 1}^{p} {\bar{x}}_{i}^{2} \equiv A (p) . \end{aligned}$ Under $M_{0}$ , $r \sum_{i = 1}^{p} {\bar{x}}_{i}^{2} \sim χ_{p}^{2}$ , so that $\begin{aligned} A (p) & = p \log [2 (r + 1)] \\ - p (1 + O (\frac{1}{\sqrt{p}})) \to \infty as p \to \infty, \end{aligned}$ establishing consistency under $M_{0}$ .

To show inconsistency under $M_{1}$ , note that $r \sum_{i = 1}^{p} {\bar{x}}_{i}^{2} \sim χ_{p}^{2} (λ_{p})$ , with non-centrality parameter $λ_{p} = r \sum_{i = 1}^{p} μ_{i}^{2}$ . Thus $\begin{aligned} A (p) & = p \log [2 (r + 1)] \\ - (p + λ_{p}) (1 + O (\frac{1}{\sqrt{p + λ_{p}}})) \to \infty \end{aligned}$ if $τ^{2} = lim_{p \to \infty} λ_{p} / [r p] < (\log [2 (1 + r) - 1]) / r$ , establishing the inconsistency result.

To investigate consistency of PBIC and PBIC* under $M_{1}$ , note that (21) $\frac{(1 - e^{- min {v, 1.3}})}{\sqrt{v min {v, 1.3}}} \geq \frac{1 - e^{- v}}{v} \geq \frac{1}{1 + v} .$ (21) Also, because of concavity, $E [\log (1 + v)] \leq \log (1 + E [v])$ . Thus, $\begin{aligned} E [\log \frac{1 - e^{- v_{i}}}{v_{i}}] & \geq - E [\log (1 + v_{i})] \\ \geq - \log (1 + E [v_{i}]) \\ = - \log (1 + \frac{1 + r μ_{i}^{2}}{1 + r}) . \end{aligned}$ Using this inequality and the fact that $\prod_{i} ω_{i}^{1 / p} \leq (\sum ω_{i}) / p)$ , it follows that $\begin{aligned} \frac{2}{p} E [\sum_{i = 1}^{p} \log \frac{1 - e^{- v_{i}}}{v_{i}}] \\ \geq - 2 \log \prod_{i = 1}^{p} {(1 + \frac{1 + r μ_{i}^{2}}{1 + r})}^{1 / p} \\ \geq - 2 \log [\frac{1}{p} \sum_{i = 1}^{p} (1 + \frac{1 + r μ_{i}^{2}}{1 + r})] \\ = - 2 \log (\frac{2 + r + λ_{p} / p}{1 + r}) . \end{aligned}$ Hence, by the law of large numbers, $\begin{aligned} lim_{p \to \infty} \frac{1}{p} Δ PBIC \\ \leq \log [2 (r + 1)] + 2 \log (\frac{2 + r + r τ^{2}}{1 + r}) \\ - lim_{p \to \infty} (1 + \frac{λ_{p}}{p}) (1 + O (\frac{1}{\sqrt{p + λ}})) \\ = \log [2 (r + 1)] + 2 \log (\frac{2 + r + r τ^{2}}{1 + r}) \\ - (1 + r τ^{2}) (1 + o (1)) . \end{aligned}$ Let $B (r, τ^{2})$ denote the right hand side above (without the $o (1)$ term). If $B (r, τ^{2}) < 0$ , then $Δ PBIC$ goes to $- \infty$ as $p \to \infty$ , and we have consistency.

Differentiating with respect to $τ^{2}$ shows that $B (r, τ^{2})$ is decreasing in $τ^{2}$ , so that, if we can find a value of $τ^{2}$ for which $B (r, τ^{2}) < 0$ , then any larger value of $τ^{2}$ will also work. As a candidate, consider $τ_{c}^{2} = [c + \log (1 + r)] / r$ . Then $\begin{aligned} B (r, τ_{c}^{2}) & = \log [2 (r + 1)] \\ + 2 \log (\frac{2 + r + c + \log (1 + r)}{1 + r}) \\ - (1 + c + \log (1 + r)) . \end{aligned}$ Differentiating this with respect to r shows that it is decreasing in r so that all we need to show is that $τ_{c}^{2}$ works for r=1. Indeed, $\begin{aligned} B (1, τ_{c}^{2}) & = \log [4] + 2 \log (\frac{3 + c + \log 2}{2}) \\ - (1 + c + \log 2) < 0, \end{aligned}$ if c>1.67. Since $1 + \log 2 = 1.693 > 1.67$ , the condition for consistency of PBIC in the theorem is established. And because of (Equation21(21) $\frac{(1 - e^{- min {v, 1.3}})}{\sqrt{v min {v, 1.3}}} \geq \frac{1 - e^{- v}}{v} \geq \frac{1}{1 + v} .$ (21) ), the same condition ensures that PBIC* is consistent.

Note that, if r is moderately large, PBIC and PBIC* are consistent under $M_{1}$ , unless $τ^{2}$ is extremely close to 0, i.e. unless the non-zero means are extremely close to 0; it is not surprising that it is difficult to distinguish between $M_{1}$ and $M_{0}$ in this situation.

There is a gap in the theorem between the consistency and inconsistency conditions under $M_{1}$ . The gap is quite large for small r, but shrinks as r grows. A more refined analysis would reduce the gap, but the theorem does convey the basic messages about consistency.

More generally, $M_{0}$ could be a group means model containing some zero and some non-zero means. If $M_{0}$ is nested in $M_{1}$ and the number of additional non-zero means in $M_{1}$ goes to ∞, then the theorem still applies, since the common non-zero means will be integrated out at the beginning and will not affect the analysis.

Acknowledgements

This research was begun under the auspices of the 2004–2005 SAMSI program on Latent Variables in the Social Sciences.

Disclosure statement

No potential conflict of interest was reported by the authors.

Additional information

Funding

M. J. Bayarri's research was supported by the Spanish Ministry of Education and Science [grant number MTM2010-19528]; James Berger's research was supported by USA National Science Foundation [grant numbers DMS-1007773 and DMS-1407775]; Woncheol Jang's research was supported by the National Research Foundation of Korea (NRF) grants funded by the Korea government (MSIP), No. 2014R1A4A1007895 and No. 2017R1A2B2012816; Luis Pericchi's research was supported by grant CA096297/CA096300 from the USA National Cancer Institute of the National Institutes of Health.

References

Bayarri, M. J., Berger, J. O., Forte, A., & García-Donato, G. (2012). Criteria for Bayesian model choice with application to variable selection. The Annals of Statistics, 40(3), 1550–1577. doi: 10.1214/12-AOS1013
Web of Science ®Google Scholar
Berger, J. O. (1985). Statistical decision theory and Bayesian analysis (2nd ed.). New York: Springer-Verlag.
Google Scholar
Berger, J. O., Bayarri, M. J., & Pericchi, L. R. (2014). The effective sample size. Econometric Reviews, 33, 197–217. doi: 10.1080/07474938.2013.807157
Web of Science ®Google Scholar
Berger, J. O., Ghosh, J. K., & Mukhopadhyay, N. (2003). Approximations and consistency of bayes factors as model dimension grows. Journal of Statistical Planning and Inference, 112, 241–258. doi: 10.1016/S0378-3758(02)00336-1
Web of Science ®Google Scholar
Berger, J. O., Pericchi, L. R., & Varshavsky, J. A. (1998). Bayes factors and marginal distributions in invariant situations. Sankhya: The Indian Journal of Statistics. Series A, 60, 307–321.
Google Scholar
Berger, J. O., & Sellke, T. (1987). Testing a point null hypothesis: The irreconcilability of p values and evidence. Journal of the American statistical Association, 82, 112–122.
Web of Science ®Google Scholar
Bollen, K. A., Ray, S., Zavisca, J., & Harden, J. J. (2012). A comparison of Bayes factor approximation methods including two new methods. Sociological Methods and Research, 41, 294–324. doi: 10.1177/0049124112452393
Web of Science ®Google Scholar
Chakrabarti, A., & Ghosh, J. K. (2006). A generalization of BIC for the general exponential family. Journal of Statistical Planning and Inference, 136(9), 2847–2872. doi: 10.1016/j.jspi.2005.01.005
Web of Science ®Google Scholar
Drton, M., & Plummer, M. (2017). A Bayesian information criterion for singular models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(2), 323–380. doi: 10.1111/rssb.12187
Web of Science ®Google Scholar
Dudley, R. M., & Haughton, D. (1997). Information criteria for multiple data sets and restricted parameters. Statistica Sinica, 7, 265–284.
Web of Science ®Google Scholar
Findley, D. F. (1991). Counterexamples to parsimony and BIC. Annals of the Institute of Statistical Mathematics, 43, 505–514. doi: 10.1007/BF00053369
Web of Science ®Google Scholar
Foygel, R., & Drton, M. (2010). Extended Bayesian information criteria for Gaussian graphical models. In Advances in neural information processing systems (pp. 604–612).
Google Scholar
Haughton, D. (1988). On the choice of a model to fit data from an exponential family. The Annals of Statistics, 16(1), 342–355. doi: 10.1214/aos/1176350709
Web of Science ®Google Scholar
Haughton, D. (1991). Consistency of a class of information criteria for model selection in non-linear regression. Communications in Statistics: Theory and Methods, 20, 1619–1629. doi: 10.1080/03610929108830587
Web of Science ®Google Scholar
Haughton, D. (1993). Consistency of a class of information criteria for model selection in nonlinear regression. Theory of Probability and its Applications, 37, 47–53. doi: 10.1137/1137009
Web of Science ®Google Scholar
Haughton, D., Oud, J., & Jansen, R. (1997). Information and other criteria in structural equation model selection. Communications in Statistics, Part B – Simulation and Computation, 26(4), 1477–1516. doi: 10.1080/03610919708813451
Web of Science ®Google Scholar
Jeffreys, H. (1961). Theory of probability. London: Oxford University Press.
Google Scholar
Kass, R. E., & Raftery, A. (1995). Bayes factors. Journal of the American Statistical Association, 90, 773–795. doi: 10.1080/01621459.1995.10476572
Web of Science ®Google Scholar
Kass, R. E., & Vaidyanathan, S. K. (1992). Approximate Bayes factors and orthogonal parameters, with application to testing equality of two binomial proportions. Journal of the Royal Statistical Society, 54, 129–144.
Google Scholar
Kass, R. E., & Wasserman, L. (1995). A reference Bayesian test for nested hypotheses and its relationship to the Schwarz criterion. Journal of the American Statistical Association, 90, 928–934. doi: 10.1080/01621459.1995.10476592
Web of Science ®Google Scholar
Pauler, D. (1998). The Schwarz criterion and related methods for normal linear models. Biometrika, 85, 13–27. doi: 10.1093/biomet/85.1.13
Web of Science ®Google Scholar
Raftery, A. E. (1999). Bayes factors and BIC – comment on ‘A critique of the Bayesian information criterion for model selection’. Sociological Methods and Research, 27, 411–427. doi: 10.1177/0049124199027003005
Web of Science ®Google Scholar
Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6, 461–464. doi: 10.1214/aos/1176344136
Web of Science ®Google Scholar
Shen, G., & Ghosh, J. K. (2011). Developing a new BIC for detecting change-points. Journal of Statistical Planning and Inference, 141(4), 1436–1447. doi: 10.1016/j.jspi.2010.10.017
Web of Science ®Google Scholar
Stone, M. (1979). Comments on model selection criteria of Akaike and Schwarz. Journal of the Royal Statistical Society, Series B, 41, 276–278.
Google Scholar
Strawderman, W. E. (1971). Proper Bayes minimax estimators of the multivariate normal mean. The Annals of Mathematical Statistics, 42(1), 385–388. doi: 10.1214/aoms/1177693528
Google Scholar
Tierney, L., Kass, R. E., & Kadane, J. B. (1989). Fully exponential Laplace approximations to expectations and variances of nonpositive functions. Journal of the American Statistical Association, 84(407), 710–716. doi: 10.1080/01621459.1989.10478824
Web of Science ®Google Scholar
Zak-Szatkowska, M., & Bogdan, M. (2011). Modified versions of Bayesian information criterion for sparse generalized linear models. Computational Statistics and Data Analysis, 55, 2908–2924. doi: 10.1016/j.csda.2011.04.016
Web of Science ®Google Scholar

Appendix

To see that the prior in (Equation9(9) $π_{i}^{R} (ξ_{i}) = \int_{0}^{1} N (ξ_{i} | 0, \frac{1}{2 λ_{i}} (d_{i} + b_{i}) - d_{i}) \frac{1}{2 \sqrt{λ_{i}}} d λ_{i},$ (9) ) is almost the same as $π^{C}$ , the Cauchy(0,b) prior (we drop the i subscripts in this appendix), consider the extremes.

Theorem A.1

For $b \geq d,$ $\begin{aligned} lim_{| ξ | \to \infty} \frac{π^{C} (ξ)}{π^{R} (ξ)} & = \frac{2 \sqrt{b}}{\sqrt{π (b + d)}} \in (0.80, 1.13), \\ \frac{π^{C} (0)}{π^{R} (0)} & = \frac{2 d}{\sqrt{b π} (\sqrt{b + d} - \sqrt{b - d})} \\ \in (0.80, 1.13) . \end{aligned}$

Proof.

Note that $\begin{aligned} π^{R} (0) & = \frac{1}{2 \sqrt{π}} \int_{0}^{1} \frac{1}{\sqrt{d + b - 2 λ d}} \\ d λ & = \frac{\sqrt{b + d} - \sqrt{b - d}}{2 d \sqrt{π}} . \end{aligned}$ Hence $\frac{π^{C} (0)}{π^{R} (0)} = \frac{2 d}{\sqrt{b π} (\sqrt{b + d} - \sqrt{b - d})} .$ It is straightforward to show that $\sqrt{b} (\sqrt{b + d} - \sqrt{b - d})$ is decreasing in $b \geq d,$ with a maximum of $\sqrt{2} d$ and minimum of d. Thus $\sqrt{2 / π} \leq π^{C} (0) / π^{R} (0) \leq \sqrt{4 / π}$ , which (to 2 decimal places) is the result above.

To prove the result as $| ξ | \to \infty$ , separately integrate over $Γ_{1} = (0, | ξ |^{- 3 / 2})$ and $Γ_{2} = (| ξ |^{- 3 / 2}, 1)$ in (Equation9(9) $π_{i}^{R} (ξ_{i}) = \int_{0}^{1} N (ξ_{i} | 0, \frac{1}{2 λ_{i}} (d_{i} + b_{i}) - d_{i}) \frac{1}{2 \sqrt{λ_{i}}} d λ_{i},$ (9) ). For $λ \in Γ_{1}$ , note that $(d + b - 2 λ d)^{- 1} = (d + b)^{- 1} + O (| ξ |^{- 3 / 2})$ , so that $\begin{aligned} (d + b - 2 λ d)^{- 1 / 2} & = (d + b)^{- 1 / 2} + O (| ξ |^{- 3 / 2}), \\ \exp (- \frac{ξ^{2} λ}{d + b - 2 λ d}) & = \exp (- \frac{ξ^{2}}{d + b} [λ + O (| ξ |^{- 3})]) \\ = \exp (- \frac{ξ^{2}}{d + b}) (1 + O (| ξ |^{- 1})) . \end{aligned}$ Hence the integral over $Γ_{1}$ is $\begin{aligned} \frac{1}{2 \sqrt{π}} \int_{0}^{| ξ |^{- 3 / 2}} (\frac{1}{\sqrt{d + b}} + O (| ξ |^{- 3 / 2})) \\ \times \exp (- \frac{ξ^{2} λ}{d + b}) (1 + O (| ξ |^{- 1}) d λ \\ = \frac{\sqrt{d + b}}{2 \sqrt{π} ξ^{2}} (1 - \exp (- \frac{\sqrt{| ξ |}}{d + b})) (1 + O (| ξ |^{- 1}) . \end{aligned}$

Noting that $\exp (- ξ^{2} λ / [d + b - 2 λ d])$ is decreasing in λ, it is immediate that the integral over $Γ_{2}$ is bounded above by $\begin{aligned} \frac{\exp (- \sqrt{| ξ |} / [d + b])}{2 \sqrt{π} d} \int_{| ξ |^{- 3 / 2}}^{1} \frac{1}{\sqrt{d + b - 2 λ d}} d λ \\ = \frac{\exp (- \sqrt{| ξ |} / [d + b])}{2 \sqrt{π} d} \\ \times (\sqrt{d + b - 2 | ξ |^{- 3 / 2} d} - \sqrt{b - d}) = o (| ξ |^{- 2}) . \end{aligned}$

It follows that $\begin{aligned} lim_{| ξ | \to \infty} \frac{π^{C} (ξ)}{π^{R} (ξ)} = lim_{| ξ | \to \infty} \frac{2 \sqrt{π} [π^{- 1} ξ^{- 2} \sqrt{b} (1 + O (| ξ^{- 2}))]}{\begin{matrix} \sqrt{d + b} ξ^{- 2} (1 - \exp (- \sqrt{| ξ |} / [d + b])) \\ (1 + O (| ξ^{- 1})) + o (| ξ |^{- 2}) \end{matrix}} \\ = \frac{2 \sqrt{b}}{\sqrt{π (b + d)}} . \end{aligned}$

It is straightforward to show that $\sqrt{2 / π} \leq 2 \sqrt{b} / \sqrt{π (b + d)} \leq \sqrt{4 / π}$ , yielding (to two decimal places) the conclusion.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Order Reprints Request Corporate Permissions

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

Request Academic Permissions

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.

Download PDF

Share icon
Back to Top

Related research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.

People also read
Recommended articles
Cited by

To cite this article:

Reference style: APA Chicago Harvard

Citation copied to clipboard

Reference styles above use APA (6th edition), Chicago (16th edition) & Harvard (10th edition)

Download citation

Download a citation file in RIS format that can be imported by citation management software including EndNote, ProCite, RefWorks and Reference Manager.

Choose format: RIS BibTex RefWorks Direct Export

Choose options: Citation Citation & abstract Citation & references

Your download is now in progress and you may close this window

Did you know that with a free Taylor & Francis Online account you can gain access to the following benefits?

Choose new content alerts to be informed about new research of interest to you
Easy remote access to your institution's subscriptions on any device, from any location
Save your searches and schedule alerts to send you new results
Export your search results into a .csv file to support your research

Have an account?
Login now Don't have an account?
Register for free

Login or register to access this feature

Have an account?
Login now Don't have an account?
Register for free

Choose new content alerts to be informed about new research of interest to you
Easy remote access to your institution's subscriptions on any device, from any location
Save your searches and schedule alerts to send you new results
Export your search results into a .csv file to support your research

Prior-based Bayesian information criterion

ABSTRACT

1. Background

1.1. The original BIC (Schwarz, Citation1978)

Stone, Citation1979

1.2. Problems with general use of BIC

Group means

Findley's BIC counterexample

Random effect group means

1.3. Variants of BIC

1.4. Overview of the paper

2. The PBIC solution

2.1. Two important preliminaries

2.2. PBIC and PBIC* definitions

2.2.1. Laplace approximation

2.2.2. Choosing a good prior $π (θ)$

3. Defining ‘effective sample size’ $n_{j}$ , for parameter $ξ_{j}$

3.1. Effective sample sizes in linear models

Group means example

Orthogonal and related designs

Heteroscedastic independent observations

3.2. A general definition of effective sample size

4. PBIC*: a version more favourable to complex models

5. PBIC and PBIC* for the linear model

5.1. The expressions

5.2. Simple linear regression

5.3. Testing equality of two means with unequal variances

5.4. Findley's counterexample to BIC

5.5. Consistency of PBIC and PBIC* as $p \to \infty$ in the group means problem

Acknowledgements

Disclosure statement

References

Appendix

Information for

Open access

Opportunities

Help and information

Prior-based Bayesian information criterion

ABSTRACT

1. Background

1.1. The original BIC (Schwarz, Citation1978)

Stone, Citation1979

1.2. Problems with general use of BIC

Group means

Findley's BIC counterexample

Random effect group means

1.3. Variants of BIC

1.4. Overview of the paper

2. The PBIC solution

2.1. Two important preliminaries

2.2. PBIC and PBIC* definitions

2.2.1. Laplace approximation

2.2.2. Choosing a good prior π(θ)

3. Defining ‘effective sample size’ nj, for parameter ξj

3.1. Effective sample sizes in linear models

Group means example

Orthogonal and related designs

Heteroscedastic independent observations

3.2. A general definition of effective sample size

4. PBIC*: a version more favourable to complex models

5. PBIC and PBIC* for the linear model

5.1. The expressions

5.2. Simple linear regression

5.3. Testing equality of two means with unequal variances

5.4. Findley's counterexample to BIC

5.5. Consistency of PBIC and PBIC* as p→∞ in the group means problem

Acknowledgements

Disclosure statement

Additional information

Funding

References

Appendix

Reprints and Corporate Permissions

Academic Permissions

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date

2.2.2. Choosing a good prior $π (θ)$

3. Defining ‘effective sample size’ $n_{j}$ , for parameter $ξ_{j}$

5.5. Consistency of PBIC and PBIC* as $p \to \infty$ in the group means problem