220
Views
0
CrossRef citations to date
0
Altmetric
Discussion Paper and Discussions

Discussion of prior-based Bayesian information criterion (PBIC) by M.J. Bayarria, James O. Berger, Woncheol Jang, Surajit Ray, Luis R. Pericchi, and Ingmar Visser

ORCID Icon & ORCID Icon
Pages 32-34 | Received 07 Apr 2019, Accepted 22 Apr 2019, Published online: 02 May 2019

We congratulate the authors on their innovative and illuminating contribution. Their paper should not only lead to more refined and defensible applications of the Bayesian information criterion (BIC) through their proposed variants, but should also facilitate a deeper understanding of BIC and its theoretical underpinnings.

The development of the prior-based BIC variants, PBIC and PBIC*, results from a reconsideration of the Laplace approximation employed in the large-sample justification for BIC. The authors' more nuanced application of the Laplace approximation leads to the inclusion of terms based on (1) the log of the determinant of the observed information for those parameters that are common to all of the candidate models, (2) standardised estimates of the transformed parameters for those parameters that vary among the candidate models, and (3) an ‘effective sample size’ for each transformed parameter. The terms based on (2) and (3) replace the conventional penalty term of BIC.

An additional refinement to BIC could be incorporated based on terms governed by the prior probabilities assigned to each of the candidate models. To introduce such a correction, we consider the initial stages of the development that leads to BIC.

Assume that the observed x is to be described using a model Mk selected from a set of candidates M1,M2,,ML. Suppose that each Mk is uniquely parameterised by a vector θk (k{1,2,,L}). Let l(θk|x) denote the likelihood for x based on Mk.

Let p(Mk) denote a discrete prior over the models M1,M2,,ML, specified so that p(Mk)>0 for all k{1,2,,L}, and k=1Lp(Mk)=1. Let π(θk|Mk) denote a prior on θk given the model Mk.

Through the application of Bayes' Theorem, for the joint posterior of Mk and θk, we have h((Mk,θk)|x)p(Mk)π(θk|Mk)l(θk|x).

Here, the constant of proportionality, say K(x), depends on the data x, yet not on the constructs for model Mk.

A Bayesian model selection rule might aim to choose the model Mk which is a posteriori most probable. For the posterior probability for Mk, we then have P(Mk|x)=K(x)p(Mk)l(θk|x)π(θk|Mk)dθk. If we consider minimising 2logP(Mk|x) as opposed to maximising P(Mk|x), we obtain 2logP(Mk|y)=2logK(x)2logp(Mk)2logl(θk|x)π(θk|Mk)dθk. Since the term involving K(x) does not vary in accordance with the structure of the model Mk, for the purpose of model selection, this term can be discarded. We thereby obtain (1) 2logp(Mk)2logl(θk|x)π(θk|Mk)dθk.(1) The authors' variants of BIC result through a refined approximation of the integral l(θk|x)π(θk|Mk)dθk.

In comparing candidate models based on differences in (Equation1), the terms 2logp(Mk) are (i) immaterial under a uniform prior distribution p(Mk), and (ii) negligible in large-sample settings where the prior probabilities are not markedly different. In the asymptotic justification of BIC, these terms are discarded. However, the terms involving p(Mk) could play a role in smaller sample settings where candidate models are differentially favoured (e.g. Neath & Cavanaugh, Citation1997).

Additionally, a uniform prior on the candidate models can lead to inconsistency in high-dimensional sparse settings. To further explore this potential problem, consider a regression setting based on P potential covariates. Let s refer to the number of true ‘active’ regression parameters in the generating model, and let the saturation level ω refer to the proportion of all P parameters that are in the generating model, so that ω=s/P. The sparsity level (i.e. the proportion of regression parameters that are truly inactive) is simply (1ω).

Assuming a uniform prior on the collection of candidate models induces a marginal distribution on the saturation level ω that becomes increasingly concentrated about ω=0.5 as P increases. For example, consider performing best-subsets selection with P =10 covariates. One might perceive that a defensible approach for determining the final model would be to choose the fitted model corresponding to the lowest BIC. However, since BIC implicitly imposes a uniform prior on the candidate models, and since there are more models with 5 covariates than with 1 or 2 covariates, BIC favours models of size 5 over models of size 1 or 2. In fact, the prior distribution for model size is centred among values near 5; see Figure . As P grows, this prior becomes increasingly concentrated around P/2. Consequently, the prior distribution on the saturation level becomes progressively more dense around ω=0.5.

Figure 1. The prior instituted by BIC on the marginal distribution of model size (and, consequently, the saturation level) for P=10 (top) and P=100 (bottom).

Figure 1. The prior instituted by BIC on the marginal distribution of model size (and, consequently, the saturation level) for P=10 (top) and P=100 (bottom).

Chen and Chen (Citation2008) proffer a solution to this problem: the extended Bayesian information criterion (EBIC). EBIC corrects for the prior imbalance in model size by incorporating an additional term in the formulation of BIC that penalises a model in accordance with the number of candidate models of that size. Let mk denote the dimension of the regression parameter vector for Mk. In the context of best-subsets selection, EBIC is defined as EBIC=2logl(θˆk|x)+mklogn+2γlogPmk=BIC+2γlogPmk.

The additional penalty term for EBIC can be conveniently conceptualised in the context of the ubiquitous statistical metaphor of balls and urns. In any variable selection problem, each covariate can be viewed as a ball in an urn consisting of P balls. Each model Mk is defined by a random draw, without replacement, of mk balls from that urn. There are Pmk ways of selecting the mk balls, which provides the reason that this combinatorics term arises in the criterion development.

The additional penalty term of EBIC involves a tuning parameter γ, which is fixed at a value between 0 and 1, inclusive. Different values of γ lead to important special cases of the criterion, which are depicted in Figure . If γ=0, EBIC becomes the original BIC. Setting γ=1 yields a uniform prior on the marginal distribution of model size (and consequently, ω). However, setting γ=1 leads to a criterion that can be quite conservative in practice. The specification of γ is associated with different consistency properties. Broadly speaking, BIC will be inconsistent when P>n, and EBIC corrects for this. Note that since γ[0,1], the penalty for EBIC will always be greater than or equal to BIC; thus, EBIC will always be at least as conservative as BIC, if not more. A more extensive discussion about the specification of γ and related consistency implications can be found in Chen and Chen (Citation2008).

In the best-subsets setting, a similar motivation modifies the prior distributions for all of the models to induce a more formal preference for sparse models (Bogdan, Ghosh, & Doerge, Citation2004; Bogdan, Ghosh, & Zak-Szatkowska, Citation2008). The resulting criterion is referred to as the modified Bayesian information criterion (mBIC). Unlike the symmetric priors of BIC and EBIC, the formulation of mBIC utilises a right-skewed prior probability mass function on model size, where the degree of skewness is governed by the saturation level ω. For mBIC, instead of specifying a γ parameter, one must specify the ‘expected’ or ‘central’ saturation level, which we will denote as w. For a central saturation level w, and for a model Mk with mk parameters, p(Mk)=wmk(1w)Pmk. Thus, mBIC views each coefficient as a random draw from an underlying population of effects where wP are active and (1w)P are inactive, but we do not know which are which a priori.

Figure 2. Marginal distributions for model size resulting from the γ parameter for EBIC.

Figure 2. Marginal distributions for model size resulting from the γ parameter for EBIC.

Of course, of the two terms in (Equation1), the term based on the integral l(θk|x)π(θk|Mk)dθk is of primary importance; the authors have justifiably focussed on refining the approximation of this integral in the development of their BIC variants. However, the inclusion of the additional terms 2logp(Mk) in PBIC and PBIC* could potentially be beneficial in instances where it is justifiable to employ priors p(Mk) that differentially favour certain models in the candidate collection.

Disclosure statement

No potential conflict of interest was reported by the authors.

Additional information

Notes on contributors

Ryan A. Peterson

Dr Ryan A. Peterson recently received his Ph.D. from the Department of Biostatistics in the College of Public Health at the University of Iowa. In the summer of 2019, he will be joining the faculty of the Department of Biostatistics and Informatics in the Colorado School of Public Health at the University of Colorado Anschutz Medical Campus. His methodological research interests include variable selection, model selection, machine learning, high-dimensional data analysis, and computational statistics.

Joseph E. Cavanaugh

Dr Joseph E. Cavanaugh is a Professor of Biostatistics and Head of the Department of Biostatistics in the College of Public Health at the University of Iowa. He holds a secondary appointment in the Department of Statistics and Actuarial Science and is an affiliate professor in the Applied Mathematical and Computational Sciences interdisciplinary doctoral programme. His methodological research interests include variable selection, model selection, time series analysis, modelling diagnostics, and computational statistics.

References

  • Bogdan, M., Ghosh, J. K., & Doerge, R. W. (2004). Modifying the Schwarz Bayesian information criterion to locate multiple interacting quantitative trait loci. Genetics, 167, 989–999. doi: 10.1534/genetics.103.021683
  • Bogdan, M., Ghosh, J. K., & Zak-Szatkowska, M. (2008). Selecting explanatory variables with the modified version of the Bayesian information criterion. Quality and Reliability Engineering International, 24, 627–641. doi: 10.1002/qre.936
  • Chen, J., & Chen, Z. (2008). Extended Bayesian information criteria for model selection with large model spaces. Biometrika, 95, 759–771. doi: 10.1093/biomet/asn034
  • Neath, A. A., & Cavanaugh, J. E. (1997). Regression and time series model selection using variants of the Schwarz information criterion. Communications in Statistics – Theory and Methods, 26, 559–580. doi: 10.1080/03610929708831934

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.