350

Views

CrossRef citations to date

Altmetric

Discussion Paper and Discussions

A discussion of ‘prior-based Bayesian information criterion’

Jiahua ChenDepartment of Statistics, University of British Columbia, Vancouver, BC, CanadaCorrespondence[email protected]
View further author information

Zeny FengDepartment of Mathematics and Statistics, University of Guelph, Guelph, ON, CanadaView further author information

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

We would like to thank the authors (Bayarri et al., Citation2018) for their interesting and provoking paper, and we wish to discuss some issues related to sample size in general and the number of covariates in the context of linear regression model when using the Bayesian information criteria (BIC) for model selection. Schwarz (Citation1978) was the first to develop tools for estimating the dimension of parameters among distributions in exponential family and consequently, introduce the BIC to serve as an approximation to the Bayesian posterior probability of a given model. The BIC has been used in a broad context and has been widely adapted for model selection despite that there are situations where the BIC might not be appropriate. Returning to its root as in this discussion paper is essential when the model and the data structure markedly deviate from the original context.

The original BIC criterion targets models arose from distributions belonging to an exponential family which permits a neat and simple analytical form after Laplace approximation. The neatness of this form is a blessing, but unfortunately, can be a curse as well. When the data is deprived of the independent and identically distributed (iid) structure, a blind application of BIC will not survive a close scrutiny. As discussed in Bayarri et al. (Citation2018), the sample size in BIC becomes problematic. The prior-based BIC (PBIC) proposed in Bayarri et al. (Citation2018) is essential to overcome these issues. This paper timely draws our attention to many unsettling issues related to the use of BIC in non-standard situations.

1. The classical and mutated BICs

Suppose we have a statistical model, referring to a specific family of distributions as usual, denoted as $M = {f (x; θ) : θ \in Θ \subset R^{p}} .$ The density function $f (x; θ)$ under consideration is usually chosen to have nice mathematical properties such as being regular. The dimension of the parameter θ remains the same within a model. This assumption is not obviously seen in the above presentation.

When the above model M is chosen for a population and a random sample is provided, statistical analysis is to infer the θ value of the population. In the context of Bayesian analysis, the θ value is regarded as uncertain and the level of uncertainty is specified by a prior density, say $π (θ)$ . The combination of the prior on the θ and the data sampled from the population lead to the posterior distribution which is the basis of the statistical decision.

When there are many competing models, say $M_{1}, M_{2}, \dots, M_{J}$ , a prior probability should be decided for each of these $M_{j}$ 's. Let $α_{j}$ denote the prior probability for $M_{j}, j = 1, \dots, J$ . For notational simplicity, we use $f_{j} (x; θ_{j})$ for the density function in model $M_{j}$ , $p_{j}$ for the dimension of $Θ_{j}$ which is the parameter space of $M_{j}$ , and we also use some obvious conventions such as M, p and θ as some generic versions.

Let $x_{n}$ be a sample of size n from a distribution which is a member of model M. By Bayes formula, the posterior probability that this M is $M_{j}$ , is proportional to (1) $post (M_{j}) = α_{j} \int f_{j} (x_{n}; θ_{j}) π_{j} (θ_{j}) d θ_{j} .$ (1) Equation (Equation1(1) $post (M_{j}) = α_{j} \int f_{j} (x_{n}; θ_{j}) π_{j} (θ_{j}) d θ_{j} .$ (1) ) precisely corresponds to $S (Y, n, j)$ of Schwarz (Citation1978, p. 462). The development in Schwarz (Citation1978) is restricted to exponential family and under the assumption that the density is a function of $x_{n}$ through $Y = Y (x_{n})$ . Other than the factor $α_{j}$ , our (Equation1(1) $post (M_{j}) = α_{j} \int f_{j} (x_{n}; θ_{j}) π_{j} (θ_{j}) d θ_{j} .$ (1) ) duplicates the function $m (x)$ of Bayarri et al. (Citation2018). The subindex j in our expression highlights the fact that the form of $θ_{j}$ depends on model $M_{j}$ .

If an accurate computation of post( $M_{j}$ ) is cheap, then we would select the model $M_{j}$ that maximises the posterior probability. In most cases, use some computationally feasible approximation is more realistic but may lead to complications.

Suppose $x_{n}$ consists of n independent and identically distributed observations. Let ${\hat{θ}}_{j}$ be the maximum likelihood estimator (MLE) of $θ_{j}$ under model $M_{j}$ and $ℓ_{n} (\cdot)$ be generic log likelihood function suitable for all $M_{1}, M_{2}, \dots, M_{J}$ . Then under reasonable conditions, Laplace approximation leads to the authentic BIC: (2) $\begin{aligned} aBIC (M_{j}) & = - 2 \log \{α_{j} \int f_{j} (x_{n}; θ_{j}) π_{j} (θ_{j}) d θ_{j}\} \\ = - 2 ℓ_{n} ({\hat{θ}}_{j}) + p_{j} \log n + c_{j} + o_{p} (1) . \end{aligned}$ (2) Note that $c_{j}$ crucially depends on $M_{j}$ through at least $α_{j}$ , $π_{j}$ and $f_{j}$ as we understand that $f_{j}$ and $M_{j}$ are two names for the same notion. When n is very large, the $c_{j}$ and $o_{p} (1)$ can be wrapped up to $O_{p} (1)$ and the aBIC( $M_{j}$ ) arrives at the classical BIC: (3) $BIC (M_{j}) = - 2 ℓ_{n} ({\hat{θ}}_{j}) + p_{j} \log n .$ (3) However, unless n is very large, the size of $c_{j}$ is not negligible. In other words, the aBIC in (Equation2(2) $\begin{aligned} aBIC (M_{j}) & = - 2 \log \{α_{j} \int f_{j} (x_{n}; θ_{j}) π_{j} (θ_{j}) d θ_{j}\} \\ = - 2 ℓ_{n} ({\hat{θ}}_{j}) + p_{j} \log n + c_{j} + o_{p} (1) . \end{aligned}$ (2) ) and BIC in (Equation3(3) $BIC (M_{j}) = - 2 ℓ_{n} ({\hat{θ}}_{j}) + p_{j} \log n .$ (3) ) can be very different and so that the BIC would no longer be a good approximation to the aBIC. Should $c_{j}$ be taken into consideration? Bayarri et al. (Citation2018) gives a positive answer to this question.

Taking this in mind, let us look into details for a data set of sample size n in the classical BIC and the prior-based BIC of the authors. Recalled that the BIC in Schwarz (Citation1978) is derived when there are n iid observations from an exponential family and having these n iid observations of some dimensions but not necessarily the dimension of the parameters. The dimension of parameters p in the BIC refers to the dimension of $Y (x)$ where $Y (x)$ is a vector of statistics, not that of x in the exponential family model. Once we leave the comfort zone of iid and exponential family, direct application of BIC in (Equation3(3) $BIC (M_{j}) = - 2 ℓ_{n} ({\hat{θ}}_{j}) + p_{j} \log n .$ (3) ) is questionable though it is now a common practice. Consider an extreme case where we have n iid observations from a distribution in an exponential family, but each observation is duplicated exactly twice in the data set. The apparent sample size is therefore $2 n$ but the (correct) likelihood is not affected by the duplication. Applying classical BIC merely in formality leads to the wrongful BIC: ${BIC}_{w} = - 2 \sum_{i = 1}^{n} \log f (x_{i}; {\hat{θ}}_{j}) + p_{j} \log (2 n) .$ Yet its difference from the rightful BIC is merely a constant $p_{j} \log (2)$ , which may well be regarded as part of $c_{j}$ in aBIC. We suggest from this analysis that if omitting terms of $O_{p} (1)$ in BIC is acceptable, the precise definition of effective sample size is not so crucial.

Suppose that θ is a vector. When θ in a small neighbourhood of the truth, and therefore it is also in a small neighbourhood of $\hat{θ}$ when the MLE is consistent, we have $ℓ_{n} (θ) \approx ℓ_{n} (\hat{θ}) - \frac{1}{2} (θ - \hat{θ})^{τ} I_{n} (θ - \hat{θ}),$ where $I_{n} = I_{n} (\hat{θ})$ is the Fisher information matrix at $\hat{θ}$ . The faithful Laplace approximation would lead to (4) $\begin{aligned} aBIC (M_{j}) = - 2 ℓ_{n} ({\hat{θ}}_{j}) + \log det {I_{n} ({\hat{θ}}_{j})} + c_{j} + o_{p} (1) \end{aligned}$ (4) assuming $\log det {I_{n} (θ_{j})} \to \infty$ as $n \to \infty$ . By this, we realise that $c_{j}$ remains dependent on $α_{j}$ , $p_{j}$ and $π_{j}$ , but the dependence of aBIC on $f_{j}$ has been accommodated in the Fisher information. In common applications, we may choose to omit $O_{p} (1)$ constants related to $α_{j}$ and $π_{j}$ in BIC. After which, we seem to have defined the effective sample sizes via $det {I_{n} ({\hat{θ}}_{j})}$ .

Consider the Example 1.3 of Bayarri et al. (Citation2018) where n/2 observations are iid from $N (θ, 1)$ and another n/2 observations are iid from $N (θ, 1000)$ . The Fisher information for θ is given by $I (θ) = (1 + 1 / 1000) (n / 2) \approx n / 2$ . Our understanding is therefore in good agreement with Bayarri et al. (Citation2018). In Example 1.4 of Bayarri et al. (Citation2018), the Fisher information for θ is given by $\sum_{i = 1}^{n} i^{- 1} \approx \log n$ . Hence, using the above suggested approximate aBIC, we would have get $aBIC (θ \neq 0) = - 2 ℓ_{n} ({\hat{θ}}_{j}) + \log \log (n) .$ Our suggestion on sample size is also found reasonable when applied to Example 1.2 of Bayarri et al. (Citation2018) and therefore in good agreement with the Prior-based BIC of the authors.

2. The role of parameter dimension p

Our view on the role of the dimension of the parameter p in BIC differs from Bayarri et al. (Citation2018). Our starting point is that part of $c_{j}$ omitted as an $O_{p} (1)$ term in aBIC to arrive at BIC is related to p. When p is very large, the resulting approximation may lead to a nonsensical model selection criterion. We use the extended BIC (EBIC) as an example which is proposed by Chen and Chen (Citation2008) that are suitable for small-n-large-p problems. Consider the classical linear regression model when n independent observations are obtained and the dimension of the explanatory variable is q. We use q instead of p to avoid potential confusion. In the era of big data, the number of explanatory variables q can be much larger than n. Let $M_{j}$ be the collection of models where the expectation of the response is a linear combination of exactly j explanatory variables. One generally regards that each specific set of j explanatory variables makes up a model of its own right. Let $M_{j k}$ , $k = 1, 2, \dots, (\binom{q}{j})$ be these models. From this angle, the cardinality of $M_{j}$ is $(\binom{q}{j})$ . When aBIC is used, a prior probability $α_{j k}$ is required for every $M_{j k}$ and they lead to a total prior probability for $M_{j}$ .

Suppose one puts $α_{j k} \propto 1$ as it is clearly the default choice in BIC, we have $α_{j} \propto \sum_{k = 1}^{(\binom{p}{j})} α_{j k} = (\binom{q}{j})$ for model set $M_{j}$ . When q=1,000,000, we have $α_{2} = 50, 000 α_{1}$ . In small-n-large-p problems, this implies that a linear model with two explanatory variables is 50,000 times more likely to be selected than a model with one explanatory variable if BIC is applied without any modifications. This is apparently controversial and leads to inconsistent model selection when n has a lower order than q.

To fix this problem, Chen and Chen (Citation2008) suggest to put $α_{j k} \propto {(\binom{q}{j})}^{- γ}$ for some $γ \in [0, 1]$ . In applications, one would put an upper bound J not depending on n or q the number of explanatory variables allowed. Applying the Laplace approximation, the Extended BIC is obtained: $EBIC (M_{j}) = - 2 ℓ_{n} ({\hat{θ}}_{j}) + j \log (n) + 2 j γ \log (q) .$ Although the choice of $γ = 1$ is most natural, their simulation results suggest that the choice of $γ = 0.5$ is a better trade-off between model complexity and parsimony. When q is very large, EBIC demands stronger evidence in order to accommodate a model with another explanatory variable.

The development of EBIC largely overlooks other Bayesian aspects of BIC. Refinements along the line of PBIC can be fruitful.

Disclosure statement

No potential conflict of interest was reported by the authors.

Additional information

Funding

This work was supported by the Natural Sciences and Engineering Research Council of Canada [566065 and 587391].

Notes on contributors

Jiahua Chen

Professor Jiahua Chen is Canada Research Chair, Tier I at the University of British Columbia.

Zeny Feng

Professor Zeny Feng is an associate professor at the University of Guelph.

References

Bayarri, M., Berger, J. O., Jang, W., Ray, S., Pericchi, L. R., & Visser, I. (2018). Prior-based Bayesian information criterion. Statistical Theory and Related Fields.
Google Scholar
Chen, J., & Chen, Z. (2008). Extended Bayesian information criteria for model selection with large model spaces. Biometrika, 95(3), 759–771. doi: 10.1093/biomet/asn034
Web of Science ®Google Scholar
Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6(2), 461–464. doi: 10.1214/aos/1176344136
Web of Science ®Google Scholar

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Order Reprints Request Corporate Permissions

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

Request Academic Permissions

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.

A discussion of ‘prior-based Bayesian information criterion’

1. The classical and mutated BICs

2. The role of parameter dimension p

Disclosure statement

Notes on contributors

Jiahua Chen

Zeny Feng

References

Information for

Open access

Opportunities

Help and information

A discussion of ‘prior-based Bayesian information criterion’

1. The classical and mutated BICs

2. The role of parameter dimension p

Disclosure statement

Additional information

Funding

Notes on contributors

Jiahua Chen

Zeny Feng

References

Reprints and Corporate Permissions

Academic Permissions

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date