350
Views
0
CrossRef citations to date
0
Altmetric
Discussion Paper and Discussions

A discussion of ‘prior-based Bayesian information criterion’

&
Pages 14-16 | Received 11 Feb 2019, Accepted 13 Feb 2019, Published online: 27 Feb 2019

We would like to thank the authors (Bayarri et al., Citation2018) for their interesting and provoking paper, and we wish to discuss some issues related to sample size in general and the number of covariates in the context of linear regression model when using the Bayesian information criteria (BIC) for model selection. Schwarz (Citation1978) was the first to develop tools for estimating the dimension of parameters among distributions in exponential family and consequently, introduce the BIC to serve as an approximation to the Bayesian posterior probability of a given model. The BIC has been used in a broad context and has been widely adapted for model selection despite that there are situations where the BIC might not be appropriate. Returning to its root as in this discussion paper is essential when the model and the data structure markedly deviate from the original context.

The original BIC criterion targets models arose from distributions belonging to an exponential family which permits a neat and simple analytical form after Laplace approximation. The neatness of this form is a blessing, but unfortunately, can be a curse as well. When the data is deprived of the independent and identically distributed (iid) structure, a blind application of BIC will not survive a close scrutiny. As discussed in Bayarri et al. (Citation2018), the sample size in BIC becomes problematic. The prior-based BIC (PBIC) proposed in Bayarri et al. (Citation2018) is essential to overcome these issues. This paper timely draws our attention to many unsettling issues related to the use of BIC in non-standard situations.

1. The classical and mutated BICs

Suppose we have a statistical model, referring to a specific family of distributions as usual, denoted as M={f(x;θ):θΘRp}. The density function f(x;θ) under consideration is usually chosen to have nice mathematical properties such as being regular. The dimension of the parameter θ remains the same within a model. This assumption is not obviously seen in the above presentation.

When the above model M is chosen for a population and a random sample is provided, statistical analysis is to infer the θ value of the population. In the context of Bayesian analysis, the θ value is regarded as uncertain and the level of uncertainty is specified by a prior density, say π(θ). The combination of the prior on the θ and the data sampled from the population lead to the posterior distribution which is the basis of the statistical decision.

When there are many competing models, say M1,M2,,MJ, a prior probability should be decided for each of these Mj's. Let αj denote the prior probability for Mj,j=1,,J. For notational simplicity, we use fj(x;θj) for the density function in model Mj, pj for the dimension of Θj which is the parameter space of Mj, and we also use some obvious conventions such as M, p and θ as some generic versions.

Let xn be a sample of size n from a distribution which is a member of model M. By Bayes formula, the posterior probability that this M is Mj, is proportional to (1) post(Mj)=αjfj(xn;θj)πj(θj)dθj.(1) Equation (Equation1) precisely corresponds to S(Y,n,j) of Schwarz (Citation1978, p. 462). The development in Schwarz (Citation1978) is restricted to exponential family and under the assumption that the density is a function of xn through Y=Y(xn). Other than the factor αj, our (Equation1) duplicates the function m(x) of Bayarri et al. (Citation2018). The subindex j in our expression highlights the fact that the form of θj depends on model Mj.

If an accurate computation of post(Mj) is cheap, then we would select the model Mj that maximises the posterior probability. In most cases, use some computationally feasible approximation is more realistic but may lead to complications.

Suppose xn consists of n independent and identically distributed observations. Let θˆj be the maximum likelihood estimator (MLE) of θj under model Mj and n() be generic log likelihood function suitable for all M1,M2,,MJ. Then under reasonable conditions, Laplace approximation leads to the authentic BIC: (2) aBIC(Mj)=2logαjfj(xn;θj)πj(θj)dθj=2n(θˆj)+pjlogn+cj+op(1).(2) Note that cj crucially depends on Mj through at least αj, πj and fj as we understand that fj and Mj are two names for the same notion. When n is very large, the cj and op(1) can be wrapped up to Op(1) and the aBIC(Mj) arrives at the classical BIC: (3) BIC(Mj)=2n(θˆj)+pjlogn.(3) However, unless n is very large, the size of cj is not negligible. In other words, the aBIC in (Equation2) and BIC in (Equation3) can be very different and so that the BIC would no longer be a good approximation to the aBIC. Should cj be taken into consideration? Bayarri et al. (Citation2018) gives a positive answer to this question.

Taking this in mind, let us look into details for a data set of sample size n in the classical BIC and the prior-based BIC of the authors. Recalled that the BIC in Schwarz (Citation1978) is derived when there are n iid observations from an exponential family and having these n iid observations of some dimensions but not necessarily the dimension of the parameters. The dimension of parameters p in the BIC refers to the dimension of Y(x) where Y(x) is a vector of statistics, not that of x in the exponential family model. Once we leave the comfort zone of iid and exponential family, direct application of BIC in (Equation3) is questionable though it is now a common practice. Consider an extreme case where we have n iid observations from a distribution in an exponential family, but each observation is duplicated exactly twice in the data set. The apparent sample size is therefore 2n but the (correct) likelihood is not affected by the duplication. Applying classical BIC merely in formality leads to the wrongful BIC: BICw=2i=1nlogf(xi;θˆj)+pjlog(2n). Yet its difference from the rightful BIC is merely a constant pjlog(2), which may well be regarded as part of cj in aBIC. We suggest from this analysis that if omitting terms of Op(1) in BIC is acceptable, the precise definition of effective sample size is not so crucial.

Suppose that θ is a vector. When θ in a small neighbourhood of the truth, and therefore it is also in a small neighbourhood of θˆ when the MLE is consistent, we have n(θ)n(θˆ)12(θθˆ)τIn(θθˆ), where In=In(θˆ) is the Fisher information matrix at θˆ. The faithful Laplace approximation would lead to (4) aBIC(Mj)=2n(θˆj)+logdet{In(θˆj)}+cj+op(1)(4) assuming logdet{In(θj)} as n. By this, we realise that cj remains dependent on αj, pj and πj, but the dependence of aBIC on fj has been accommodated in the Fisher information. In common applications, we may choose to omit Op(1) constants related to αj and πj in BIC. After which, we seem to have defined the effective sample sizes via det{In(θˆj)}.

Consider the Example 1.3 of Bayarri et al. (Citation2018) where n/2 observations are iid from N(θ,1) and another n/2 observations are iid from N(θ,1000). The Fisher information for θ is given by I(θ)=(1+1/1000)(n/2)n/2. Our understanding is therefore in good agreement with Bayarri et al. (Citation2018). In Example 1.4 of Bayarri et al. (Citation2018), the Fisher information for θ is given by i=1ni1logn. Hence, using the above suggested approximate aBIC, we would have get aBIC(θ0)=2n(θˆj)+loglog(n). Our suggestion on sample size is also found reasonable when applied to Example 1.2 of Bayarri et al. (Citation2018) and therefore in good agreement with the Prior-based BIC of the authors.

2. The role of parameter dimension p

Our view on the role of the dimension of the parameter p in BIC differs from Bayarri et al. (Citation2018). Our starting point is that part of cj omitted as an Op(1) term in aBIC to arrive at BIC is related to p. When p is very large, the resulting approximation may lead to a nonsensical model selection criterion. We use the extended BIC (EBIC) as an example which is proposed by Chen and Chen (Citation2008) that are suitable for small-n-large-p problems. Consider the classical linear regression model when n independent observations are obtained and the dimension of the explanatory variable is q. We use q instead of p to avoid potential confusion. In the era of big data, the number of explanatory variables q can be much larger than n. Let Mj be the collection of models where the expectation of the response is a linear combination of exactly j explanatory variables. One generally regards that each specific set of j explanatory variables makes up a model of its own right. Let Mjk, k=1,2,,qj be these models. From this angle, the cardinality of Mj is qj. When aBIC is used, a prior probability αjk is required for every Mjk and they lead to a total prior probability for Mj.

Suppose one puts αjk1 as it is clearly the default choice in BIC, we have αjk=1pjαjk=qj for model set Mj. When q=1,000,000, we have α2=50,000α1. In small-n-large-p problems, this implies that a linear model with two explanatory variables is 50,000 times more likely to be selected than a model with one explanatory variable if BIC is applied without any modifications. This is apparently controversial and leads to inconsistent model selection when n has a lower order than q.

To fix this problem, Chen and Chen (Citation2008) suggest to put αjkqjγ for some γ[0,1]. In applications, one would put an upper bound J not depending on n or q the number of explanatory variables allowed. Applying the Laplace approximation, the Extended BIC is obtained: EBIC(Mj)=2n(θˆj)+jlog(n)+2jγlog(q). Although the choice of γ=1 is most natural, their simulation results suggest that the choice of γ=0.5 is a better trade-off between model complexity and parsimony. When q is very large, EBIC demands stronger evidence in order to accommodate a model with another explanatory variable.

The development of EBIC largely overlooks other Bayesian aspects of BIC. Refinements along the line of PBIC can be fruitful.

Disclosure statement

No potential conflict of interest was reported by the authors.

Additional information

Funding

This work was supported by the Natural Sciences and Engineering Research Council of Canada [566065 and 587391].

Notes on contributors

Jiahua Chen

Professor Jiahua Chen is Canada Research Chair, Tier I at the University of British Columbia.

Zeny Feng

Professor Zeny Feng is an associate professor at the University of Guelph.

References

  • Bayarri, M., Berger, J. O., Jang, W., Ray, S., Pericchi, L. R., & Visser, I. (2018). Prior-based Bayesian information criterion. Statistical Theory and Related Fields.
  • Chen, J., & Chen, Z. (2008). Extended Bayesian information criteria for model selection with large model spaces. Biometrika, 95(3), 759–771. doi: 10.1093/biomet/asn034
  • Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6(2), 461–464. doi: 10.1214/aos/1176344136

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.