![MathJax Logo](/templates/jsp/_style2/_tandf/pb2/images/math-jax.gif)
ABSTRACT
We present a new approach to model selection and Bayes factor determination, based on Laplace expansions (as in BIC), which we call Prior-based Bayes Information Criterion (PBIC). In this approach, the Laplace expansion is only done with the likelihood function, and then a suitable prior distribution is chosen to allow exact computation of the (approximate) marginal likelihood arising from the Laplace approximation and the prior. The result is a closed-form expression similar to BIC, but now involves a term arising from the prior distribution (which BIC ignores) and also incorporates the idea that different parameters can have different effective sample sizes (whereas BIC only allows one overall sample size n). We also consider a modification of PBIC which is more favourable to complex models.
1. Background
1.1. The original BIC (Schwarz, Citation1978)
Suppose that we observe for
. Here
is a unknown vector and, in Schwartz's derivation of BIC,
is an exponential family. Then the log-likelihood function is
where
. The goal of Schwarz (Citation1978) is to find a simple approximation to the marginal density
where
is a prior density for the unknown
, and use the approximation for model comparison.
Result 1.1
Stone, Citation1979
Let be the MLE of
. Then, under reasonable conditions and as
,
where c is a constant.
Schwartz then suggested comparing two models and
, using
preferring (
) as this is negative (positive). Clearly this is equivalent to basing the model comparison on the Bayes factor (odds) of
to
, with the approximation
(1)
(1)
1.2. Problems with general use of BIC
BIC is an excellent tool for the class of problems for which it was developed. Unfortunately, it is today used ubiquitously, for completely different classes of problems. We here outline some of the issues with using BIC inappropriately.
Problem 1. The term in (Equation1
(1)
(1) ) is ignored by BIC.
This could have been a serious problem even with proper use of BIC, except that there happens to be pseudo-prior distributions that yield BIC itself (Raftery, Citation1999), i.e. for which the term . These pseudo-priors are not real priors, in that they are centred at the mle's of each model, which is a problematical double use of the data. Nevertheless it is comforting that there is at least some type of prior distribution that yields BIC exactly.
Problem 2. What is ?
A common mistake in specifying n: Note that, in Schwartz's setup, there are n vector observations of dimension p, so that there are a total of np real observations. It is common to mistakenly use
as the sample size in BIC, rather than the correct n.
Different parameters can have different n.
Example 1.2
Group means
For and
, suppose we observe
where
. If
were known, this would be exactly the setup of Schwartz, and the sample size for
would be r. In effect, each
has a sample size of r associated with it. But, if
is unknown, the parameter is
and it is not reasonable to also associate the sample size of r to
, in that we know there are
degrees of freedom associated with the mle of
.
An alternative argument is to note that the observed information matrix , with
entry
is given by
where
. The information matrix suggests that the effective sample size for each
is r, while the effective sample size for
is pr. Whether we use
or pr for the sample size associated with
will not typically make much difference, whereas the difference with using r, instead, will be quite large.
Different observations can have different observed information content.
Example 1.3
Suppose each independent observation, , has probability 1/2 of arising from the
distribution and probability 1/2 of arising from the
distribution. Clearly half the observations are essentially worthless, and the ‘effective sample size’ is n/2.
Example 1.4
Findley's BIC counterexample
One of the famous counter examples against inappropriate use of BIC is in Findley (Citation1991). Suppose the observations are
(2)
(2) and we are comparing the models
and
. It turns out that the mle for θ is consistent under
(a necessary condition to apply BIC), but that BIC is inconsistent if
, in that BIC will then declare
to be the true model as
. The problem here is that, even though the information about θ goes to ∞ as n grows, it grows much more slowly than n (actually, the information grows at roughly
rate), and BIC erroneously assigns the rate to be n.
Problem 3. What is p?
Just as n is often not clearly defined for use in BIC, the parameter dimension p is often not clearly defined (see also Pauler, Citation1998).
Example 1.5
Random effect group means
Consider hierarchical or random effect versions of the group means problem, where it is assumed that
with ξ and
being unknown. The number of parameters might appear to be p+3 (the means, along with
, ξ and
), but one could, alternatively, integrate out
(since it has a known distribution) obtaining
The marginal likelihood will be the integral of this, with respect to a prior
, so that, if one is really viewing BIC as an approximation to the marginal likelihood, it would be correct to set p=3.
Problem 4. What if p grows with n?
BIC is based on an asymptotic argument with p fixed and n growing, but often p is growing with n; BIC then does not apply. If one were to erroneously apply BIC in such a situation, one could end up with inconsistency, as shone by Stone (Citation1979) for the group means example, with known variance for simplicity. Indeed, in comparing models
and
for the group means problem with r=2,
which, under
, behaves like
as p grows, thus incorrectly selecting model
.
1.3. Variants of BIC
Noting the limitations of BIC, researchers have proposed a host of generalisations, many of which have performed better than BIC under specific scenarios. Many of these methods arise from the variations in retaining the number of terms in the Laplace approximation of the Bayes Factor (Kass & Raftery, Citation1995). One variant – called the HBIC – (Haughton, Citation1988) retains the third term in the Laplace approximation of the Bayes Factor. A simulation study by Haughton, Oud, and Jansen (Citation1997) shows that HBIC performs better in model selection for structural equation models than does the usual BIC. Following HBIC, Bollen, Ray, Zavisca, and Harden (Citation2012) developed a similar criterion, called the Information matrix-based Bayesian Information Criterion (IBIC), which retains more terms in the Bayes Factor approximation and outperforms BIC and HBIC in many scenarios. Bollen et al. (Citation2012) also proposed another criterion, named the scaled unit information prior (SPBIC), which generalises the interpretation of the unit information prior in the context of BIC. For approximation of Bayes factors as the model dimension grows, Berger, Ghosh, and Mukhopadhyay (Citation2003) proposed another approximation, named GBIC. Following Berger et al. (Citation2003), a generalisation of BIC for the general exponential family was proposed by Chakrabarti and Ghosh (Citation2006), and a new BIC for change point analysis was proposed by Shen and Ghosh (Citation2011). Some other extensions of BIC include techniques for comparing graphical models (Foygel & Drton, Citation2010), singular models (Drton & Plummer, Citation2017), and sparse models (Zak-Szatkowska & Bogdan, Citation2011).
1.4. Overview of the paper
Section 2 presents a proposal to generalise BIC, in order to overcome the problems mentioned above. It is based on use of a specific (robust) prior distribution in the computation of the approximate marginal likelihood of a model. Section 3 discusses a critical aspect of the definition of PBIC, namely the need to determine the ‘effective sample size’ corresponding to each parameter in a model. Section 4 presents an alternative called PBIC*. It employs an empirical Bayes prior in computation of the marginal likelihood approximation, resulting in answers more favourable to complex models. Section 5 illustrates the use of PBIC and PBIC* in the normal linear model; it is of interest that PBIC and PBIC* correspond to exact marginal likelihoods here. Illustrations in the section are simple linear regression, testing the equality of normal means with known unequal variances, Findley's counterexample, and the group means problem, where consistency results for PBIC and PBIC* are established as .
2. The PBIC solution
We propose a solution to these problems that depends only on software that can compute mle's and observed information matrices. The basis of the solution is a modified Laplace approximation to for reasonable default priors.
2.1. Two important preliminaries
One should analytically integrate out any parameter that has a distribution given other parameters, if it is possible to do so. For example, in the hierarchical group means example, base the analysis on the marginal likelihood , rather than the full likelihood.
We will be utilising the Laplace approximation, which is most accurate (Kass & Vaidyanathan, Citation1992; Tierney, Kass, & Kadane, Citation1989) if the parameter space is transformed to be all of . Transformation to
will also be necessary for the subsequent step of the analysis. As an illustration, in the (non-multilevel) group means example, transform to
. Then
. Note that one then works with the transformed mle
and the transformed observed information matrix
In the multilevel group means example, both
and
would need to be transformed in this fashion.
2.2. PBIC and PBIC* definitions
Suppose , where
denotes the parameters that are common to all models under consideration (e.g. an intercept in linear regression). Changing notation, let p denote the dimension of
and q denote the dimension of
; note that p will typically vary from model to model, while q is fixed. Partition the observed information matrix for a model accordingly, as
(3)
(3) (If there are no common parameters to all models, then
.) Change variables to
, where
is an orthogonal matrix such that
, with
for
, and define
(the transformed mle). The choice of
does not affect the definition below. For each transformed parameter
, let
be the effective sample size corresponding to that parameter. This is the most difficult aspect of the construction, but equals the intuitive choices of parameter sample size discussed in the earlier examples; formal definitions will be presented in Section 3. Then PBIC is defined as
(4)
(4) where
. For a certain natural prior distribution, PBIC will be shown to be accurate, as an approximation to
, up to a
term as
(for fixed dimension p). Note that, if there are no common parameters to all models, then
(5)
(5) In the classic case considered by Schwartz, all
would equal a common n, and the first two terms in this expression are then BIC (up to a o(1) term); the ‘constant’ ignored by BIC is the final term in (Equation5
(5)
(5) ).
To summarise results in one place, here is an alternative version of the approximation, one which is more favourable to complex models; its development is given in Section 4:
(6)
(6) Note that, if dealing with only normal mean parameters, PBIC and PBIC* are exact as an approximation to
, as discussed below. This would mean, for instance, that when dealing with
, there would be no need to worry about the accuracy of the approximations.
Here are the steps in the derivation of PBIC.
2.2.1. Laplace approximation
By a Taylor's series expansion of about the mle
, with ∇ denoting the gradient and
being the observed information matrix as defined earlier,
(7)
(7) where
denotes a term that goes to zero as the sample size n grows. Technical conditions for the validity of this Laplace approximation can be found in, e.g. Tierney et al. (Citation1989), Kass and Vaidyanathan (Citation1992); the key assumption needed is that
occurs on the interior of the parameter space, so that
. (If not true, the analysis must proceed as in Dudley & Haughton, Citation1997; Haughton, Citation1991, Citation1993). Also, the presence of
assumes that p is fixed as n grows. We will nevertheless use this approximation, even as p grows with n, relying on the considerable evidence that the Laplace approximation is quite generally accurate.
Note that we do not use the more common version of the Laplace expansion which involves in the Taylor's expansion because we will be choosing
so that the integral in this expression can be evaluated in closed form. In particular, this means that, if we are dealing with the situation where
is the mean parameter of a normal model, then the computations herein will be entirely closed form, with no approximation being involved (and no need to then worry about p growing with n).
2.2.2. Choosing a good prior ![](//:0)
![](//:0)
Assume that the transformations in Section 2.1 have been made.
Step 1. Recall that , where
denotes the common parameters to all models. We will utilise a prior distribution
where
is defined later. The key point is that, since
is common to all models, it can be assigned a constant prior density (see, e.g. Bayarri, Berger, Forte, & García-Donato, Citation2012; Berger, Pericchi, & Varshavsky, Citation1998); choosing the constant to be
is to simplify the resulting expression. With the definitions given in (Equation3
(3)
(3) ), integrating out
results in the expression
Step 2. Change variables to , where
is an orthogonal matrix such that
, with
for
. (The choice of
does not matter in the following.) Note that
.
For this model, we will utilise a prior distribution that is independent in the , i.e.
. Then we can write
(8)
(8) For
, in a similar situation, Jeffreys (Citation1961) recommended the Cauchy
density
, where
is chosen to represent unit information for
(see Kass & Wasserman, Citation1995; also to be discussed later). A prior that yields almost the same results is
(9)
(9) which is well-defined if
. Interestingly, this prior is very similar to the Cauchy prior no matter what
happens to be (as shown in the Appendix), so we will interpret this prior (and
) exactly as we would with the Cauchy prior. The attraction of
is that the ensuing computations can be done in closed form. That one can have all the advantages that Jeffreys pointed out are possessed by the Cauchy prior for model selection, while maintaining closed form expressions, is a significant advantage when dealing with large model spaces. This prior was extensively discussed in Berger (Citation1985), as a robust prior (hence the R label) for estimation problems, but its even greater value for model selection was not recognised. (This type of prior was first utilised in Strawderman (Citation1971) in shrinkage estimation.) See also Bayarri et al. (Citation2012), where a multivariate version of this prior is utilised for model selection in normal linear models.
With the prior in (Equation9(9)
(9) ), the integral in (Equation8
(8)
(8) ) is straightforward to evaluate in closed form (first integrate over
, then over
) yielding
(10)
(10)
Step 3. Define the unit information, , by
(11)
(11) Definitions of the effective sample size will be given in Section 3. It will be the case that
so that
(the condition mentioned earlier for
to be well defined). Then (Equation10
(10)
(10) ) becomes
Since
, we thus have that
with PBIC defined in (Equation4
(4)
(4) ).
3. Defining ‘effective sample size’ ![](//:0)
, for parameter ![](//:0)
![](//:0)
The most difficult aspect of dealing with PBIC turns out to be defining the effective sample size corresponding to a parameter. We first present a solution for linear models, and then suggest a possible solution for the general case.
3.1. Effective sample sizes in linear models
Suppose that all models under consideration are linear models of the form
(12)
(12) with dimensions
and
. Here
is a common term present in all models (e.g. an intercept in linear regression), but
will differ from model to model. This fits into the framework for PBIC by defining
and
.
Since will be integrated out in PBIC, only the effective sample size for linear functions of
will be needed. The first step of the process is to orthogonalise the parameters by transforming
to
and defining
(13)
(13) Since
, the linear part of the model has not changed in this reparameterisation, but now
, so that
and
are orthogonal. There are two important aspects of this. First, since
has not been altered, the new
can still be considered common parameters in each model, and will be integrated out in PBIC, so that their changed definition is irrelevant. Second,
has not been transformed, crucial because we wish effective sample sizes for linear functions of
Write , with
, where
is the correlation matrix, and define
to be the diagonal matrix with entries
. Berger, Bayarri, and Pericchi (Citation2014) gave, as the general definition of the effective sample size (called TESS), for any scalar linear transformation
(
is
) of
,
(14)
(14)
Example 3.1
Group means example
Assume for
groups, and
replicates in the ith group, and that the
are i.i.d.
. Computation yields that TESS for
is
, as is to be expected. Note that
could be 1, which can be seen to be the lower bound on TESS for linear models when
.
Example 3.2
Orthogonal and related designs
Assume that has orthogonal columns with entries
, and that
. Simple computation here shows that
for each
.
Note that the effective sample size here is n, in contrast to the group means problem where the effective sample size can be as low as . Indeed, it can be shown that, when
, TESS will always be between 1 and n, with both limits attainable.
Example 3.3
Heteroscedastic independent observations
Assume ,
independent,
. Here the effective sample size is
Consider the particular case where, for
, we have
, whereas for the remaining
observations,
, where
is much larger than
; thus, intuitively, only the first
observations count. Then, unless
is large,
3.2. A general definition of effective sample size
Suppose one has independent observations . A possible general definition for the ‘effective sample size’ follows from considering the information associated with observation
arising from the single-observation expected information matrix
, where
Since
is the expected information about
, a reasonable way to define the effective sample size,
, is
define information weights
define the effective sample size for
as
Intuitively, is a weighted measure of the information ‘per observation’, and dividing the total information about
by this information per case seems plausible as an effective sample size.
Unfortunately, this does not seem to always be an effective definition; for instance, it does not reduce to TESS for all linear models. This should thus be viewed as primarily a starting point for future investigations of effective sample size in nonlinear models.
4. PBIC*: a version more favourable to complex models
Recall, from Raftery (Citation1999), that BIC can be thought of as arising from unit information priors for each model that are centred at the model likelihood. This choice of prior seems highly favourable to more complex models, since the prior gives virtually all of its mass to a modest neighbourhood of the likelihood for each model.
In contrast, PBIC utilises unit information priors that are centred at 0 and, hence, can give little mass to the region of high model likelihood. The fat tails of the prior do result in reasonable answers (cf. Bayarri et al., Citation2012; Jeffreys, Citation1961), but it is of interest to investigate an intermediate solution.
The intermediate solution is to keep the prior centred at 0, but choose the scales of the prior, , so that the prior will extend out to the likelihood. In our setup, this can be implemented by choosing the
so as to maximise
in (Equation10
(10)
(10) ); thus we are effectively choosing the prior in our class that is most favourable to each model. Clearly this does allow the prior to give more mass to the region of high model likelihood, but does not allow complete concentration of mass in this region.
Since this prior is maximising the marginal likelihood among the given class, it can be viewed as the empirical Bayes prior in the class. It was also a choice popularised in the ‘robust Bayes’ literature (cf. Berger & Sellke, Citation1987), and was used in Bollen et al. (Citation2012) to develop a related generalisation of BIC.
The that maximises (Equation10
(10)
(10) ) can easily be seen to be
Unfortunately, the resulting version of BIC has serious problems; in particular it will typically not be consistent as
in that, if
is zero, the prior will concentrate about zero at such a fast rate that the models with and without
are essentially equivalent (and one will fail to select the model without
with probability approaching 1). This same lack of consistency afflicts the developments in Bollen et al. (Citation2012) and the robust Bayesian choices.
The obvious solution is simply to prevent from becoming too small, and the obvious constraint is to restrict it to the region
. This yields the recommended choice
(15)
(15) This will avoid inconsistency as
in that, as long as
with c a non-zero constant, the resulting prior behaves asymptotically when
as a fixed prior, and fixed priors will yield consistency as
. (Consistency when the effective sample sizes do not grow is a more delicate issue, discussed in Section 5.5.)
Replacing with
, (Equation10
(10)
(10) ) becomes
The resulting approximation to
is given in (Equation6
(6)
(6) ).
5. PBIC and PBIC* for the linear model
5.1. The expressions
Consider the normal linear model framework in (Equation12(12)
(12) ) and assume the orthogonalisation discussed there has been carried out. This does not change PBIC, but is more convenient because we can ignore the common orthogonal parameter
(see Bayarri et al., Citation2012; Berger et al., Citation1998; Jeffreys, Citation1961 for justification), and focus only on the other parameters
, with the associated model
(16)
(16) with
given by (Equation13
(13)
(13) ).
Following the PBIC algorithm, note that . Change variables to
, where
is an orthogonal matrix such that
, with
for
. Then, for each
, define
using (Equation14
(14)
(14) ) with
, and let
, where
. Finally, recalling that
, PBIC and PBIC* are given by
(17)
(17)
(18)
(18) where
is the usual residual sum of squares corresponding to (Equation12
(12)
(12) ) and
Note that C is the same constant in any model under consideration, and hence it can be ignored in comparing models or determining Bayes factors.
In what follows we describe some important Linear Model examples. There are more, including correlated observations and autoregressive models, in Berger et al. (Citation2014).
5.2. Simple linear regression
Let , so that
Suppose we are considering two models
and
. Computation under
yields
, so that
. Also, from (Equation14
(14)
(14) ),
(19)
(19) Finally,
,
, and
complete the terms needed to define PBIC and PBIC* under
. Under
, we only need
; thus, with
,
is the obvious modification of this.
5.3. Testing equality of two means with unequal variances
Consider comparing two normal means via the test versus
, where the associated known variances,
and
are not equal. The linear model is thus
Defining
and
places this in the linear model comparison framework, where we are comparing
versus
with the covariate matrix
Under
, computation yields
so that
Also, from (Equation14
(14)
(14) ),
and
.
A special case is the standard test of equality of means when . Then
While this may look unusual, looking at the extremes indicates why this is reasonable. Indeed, as say
, note that
. In this scenario, we perfectly learn
, so the test of mean equality is really just a test that
equals this known mean, based on
observations. Attempting to utilise BIC with an adhoc choice of n, such as
, would clearly be a disaster here.
5.4. Findley's counterexample to BIC
For the simple linear model in (Equation2(2)
(2) ), computation yields that, under
,
It follows that
Since
and
(because the mle is consistent here),
Under
,
and, under
,
. Thus PBIC is consistent as
. Essentially the same argument shows that PBIC* is consistent.
5.5. Consistency of PBIC and PBIC* as ![](//:0)
in the group means problem
Bayes model selection rules for fixed priors and fixed p are virtually always consistent as the sample size . This type of consistency transfers over to rules such as PBIC and PBIC* because the priors from which they arise converge to fixed priors as
with p fixed.
There is nothing within Bayesian theory, however, that guarantees consistency of Bayes rules when the dimension p also grows. Indeed, it turns out that consistency is then a very delicate property, that can easily be violated by even standard Bayes rules. The group means problem provides a simple illustration.
Example 5.1
Consider the group means problem with known and effective sample size
fixed, and reduce to the sufficient statistics
for
. Consider comparison of the null model
with the full model
. Suppose the
are independently assigned
priors. Then it is easy to show that consistency obtains under
as
if and only if
satisfies
, assuming the limits exist. (This example was brought to our attention by J. K. Ghosh.)
After reflecting upon this, it might seem surprising that any prior could achieve consistency as . However, Berger et al. (Citation2003) computed Laplace approximations to the marginal density, for this problem, that produced consistent Bayes factors when p grows with n. They used a multivariate Cauchy prior, which does not result in a closed form Bayes factor, as arises with PBIC and PBIC*. The next theorem indicates the situation involving consistency for PBIC and PBIC*.
Theorem 5.2
For the group means problem with fixed r, PBIC and PBIC* are consistent under as
. Under
and assuming that
exists, PBIC and PBIC* are
(20)
(20)
Proof.
We utilise (Equation17(17)
(17) ) and (Equation18
(18)
(18) ) as the definitions of PBIC and PBIC*, but will ignore C since it is common to all models. Note that the
,
,
,
under
and
under
. Thus PBIC and PBIC* become, with subscripts denoting the model,
It is straightforward to show that
so that
and
satisfy
Under
,
, so that
establishing consistency under
.
To show inconsistency under , note that
, with non-centrality parameter
. Thus
if
, establishing the inconsistency result.
To investigate consistency of PBIC and PBIC* under , note that
(21)
(21) Also, because of concavity,
. Thus,
Using this inequality and the fact that
, it follows that
Hence, by the law of large numbers,
Let
denote the right hand side above (without the
term). If
, then
goes to
as
, and we have consistency.
Differentiating with respect to shows that
is decreasing in
, so that, if we can find a value of
for which
, then any larger value of
will also work. As a candidate, consider
. Then
Differentiating this with respect to r shows that it is decreasing in r so that all we need to show is that
works for r=1. Indeed,
if c>1.67. Since
, the condition for consistency of PBIC in the theorem is established. And because of (Equation21
(21)
(21) ), the same condition ensures that PBIC* is consistent.
Note that, if r is moderately large, PBIC and PBIC* are consistent under , unless
is extremely close to 0, i.e. unless the non-zero means are extremely close to 0; it is not surprising that it is difficult to distinguish between
and
in this situation.
There is a gap in the theorem between the consistency and inconsistency conditions under . The gap is quite large for small r, but shrinks as r grows. A more refined analysis would reduce the gap, but the theorem does convey the basic messages about consistency.
More generally, could be a group means model containing some zero and some non-zero means. If
is nested in
and the number of additional non-zero means in
goes to ∞, then the theorem still applies, since the common non-zero means will be integrated out at the beginning and will not affect the analysis.
Acknowledgements
This research was begun under the auspices of the 2004–2005 SAMSI program on Latent Variables in the Social Sciences.
Disclosure statement
No potential conflict of interest was reported by the authors.
Additional information
Funding
References
- Bayarri, M. J., Berger, J. O., Forte, A., & García-Donato, G. (2012). Criteria for Bayesian model choice with application to variable selection. The Annals of Statistics, 40(3), 1550–1577. doi: 10.1214/12-AOS1013
- Berger, J. O. (1985). Statistical decision theory and Bayesian analysis (2nd ed.). New York: Springer-Verlag.
- Berger, J. O., Bayarri, M. J., & Pericchi, L. R. (2014). The effective sample size. Econometric Reviews, 33, 197–217. doi: 10.1080/07474938.2013.807157
- Berger, J. O., Ghosh, J. K., & Mukhopadhyay, N. (2003). Approximations and consistency of bayes factors as model dimension grows. Journal of Statistical Planning and Inference, 112, 241–258. doi: 10.1016/S0378-3758(02)00336-1
- Berger, J. O., Pericchi, L. R., & Varshavsky, J. A. (1998). Bayes factors and marginal distributions in invariant situations. Sankhya: The Indian Journal of Statistics. Series A, 60, 307–321.
- Berger, J. O., & Sellke, T. (1987). Testing a point null hypothesis: The irreconcilability of p values and evidence. Journal of the American statistical Association, 82, 112–122.
- Bollen, K. A., Ray, S., Zavisca, J., & Harden, J. J. (2012). A comparison of Bayes factor approximation methods including two new methods. Sociological Methods and Research, 41, 294–324. doi: 10.1177/0049124112452393
- Chakrabarti, A., & Ghosh, J. K. (2006). A generalization of BIC for the general exponential family. Journal of Statistical Planning and Inference, 136(9), 2847–2872. doi: 10.1016/j.jspi.2005.01.005
- Drton, M., & Plummer, M. (2017). A Bayesian information criterion for singular models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(2), 323–380. doi: 10.1111/rssb.12187
- Dudley, R. M., & Haughton, D. (1997). Information criteria for multiple data sets and restricted parameters. Statistica Sinica, 7, 265–284.
- Findley, D. F. (1991). Counterexamples to parsimony and BIC. Annals of the Institute of Statistical Mathematics, 43, 505–514. doi: 10.1007/BF00053369
- Foygel, R., & Drton, M. (2010). Extended Bayesian information criteria for Gaussian graphical models. In Advances in neural information processing systems (pp. 604–612).
- Haughton, D. (1988). On the choice of a model to fit data from an exponential family. The Annals of Statistics, 16(1), 342–355. doi: 10.1214/aos/1176350709
- Haughton, D. (1991). Consistency of a class of information criteria for model selection in non-linear regression. Communications in Statistics: Theory and Methods, 20, 1619–1629. doi: 10.1080/03610929108830587
- Haughton, D. (1993). Consistency of a class of information criteria for model selection in nonlinear regression. Theory of Probability and its Applications, 37, 47–53. doi: 10.1137/1137009
- Haughton, D., Oud, J., & Jansen, R. (1997). Information and other criteria in structural equation model selection. Communications in Statistics, Part B – Simulation and Computation, 26(4), 1477–1516. doi: 10.1080/03610919708813451
- Jeffreys, H. (1961). Theory of probability. London: Oxford University Press.
- Kass, R. E., & Raftery, A. (1995). Bayes factors. Journal of the American Statistical Association, 90, 773–795. doi: 10.1080/01621459.1995.10476572
- Kass, R. E., & Vaidyanathan, S. K. (1992). Approximate Bayes factors and orthogonal parameters, with application to testing equality of two binomial proportions. Journal of the Royal Statistical Society, 54, 129–144.
- Kass, R. E., & Wasserman, L. (1995). A reference Bayesian test for nested hypotheses and its relationship to the Schwarz criterion. Journal of the American Statistical Association, 90, 928–934. doi: 10.1080/01621459.1995.10476592
- Pauler, D. (1998). The Schwarz criterion and related methods for normal linear models. Biometrika, 85, 13–27. doi: 10.1093/biomet/85.1.13
- Raftery, A. E. (1999). Bayes factors and BIC – comment on ‘A critique of the Bayesian information criterion for model selection’. Sociological Methods and Research, 27, 411–427. doi: 10.1177/0049124199027003005
- Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6, 461–464. doi: 10.1214/aos/1176344136
- Shen, G., & Ghosh, J. K. (2011). Developing a new BIC for detecting change-points. Journal of Statistical Planning and Inference, 141(4), 1436–1447. doi: 10.1016/j.jspi.2010.10.017
- Stone, M. (1979). Comments on model selection criteria of Akaike and Schwarz. Journal of the Royal Statistical Society, Series B, 41, 276–278.
- Strawderman, W. E. (1971). Proper Bayes minimax estimators of the multivariate normal mean. The Annals of Mathematical Statistics, 42(1), 385–388. doi: 10.1214/aoms/1177693528
- Tierney, L., Kass, R. E., & Kadane, J. B. (1989). Fully exponential Laplace approximations to expectations and variances of nonpositive functions. Journal of the American Statistical Association, 84(407), 710–716. doi: 10.1080/01621459.1989.10478824
- Zak-Szatkowska, M., & Bogdan, M. (2011). Modified versions of Bayesian information criterion for sparse generalized linear models. Computational Statistics and Data Analysis, 55, 2908–2924. doi: 10.1016/j.csda.2011.04.016
Appendix
To see that the prior in (Equation9(9)
(9) ) is almost the same as
, the Cauchy(0,b) prior (we drop the i subscripts in this appendix), consider the extremes.
Theorem A.1
For
Proof.
Note that
Hence
It is straightforward to show that
is decreasing in
with a maximum of
and minimum of d. Thus
, which (to 2 decimal places) is the result above.
To prove the result as , separately integrate over
and
in (Equation9
(9)
(9) ). For
, note that
, so that
Hence the integral over
is
Noting that is decreasing in λ, it is immediate that the integral over
is bounded above by
It follows that
It is straightforward to show that , yielding (to two decimal places) the conclusion.