![MathJax Logo](/templates/jsp/_style2/_tandf/pb2/images/math-jax.gif)
We would like to thank the authors (Bayarri et al., Citation2018) for their interesting and provoking paper, and we wish to discuss some issues related to sample size in general and the number of covariates in the context of linear regression model when using the Bayesian information criteria (BIC) for model selection. Schwarz (Citation1978) was the first to develop tools for estimating the dimension of parameters among distributions in exponential family and consequently, introduce the BIC to serve as an approximation to the Bayesian posterior probability of a given model. The BIC has been used in a broad context and has been widely adapted for model selection despite that there are situations where the BIC might not be appropriate. Returning to its root as in this discussion paper is essential when the model and the data structure markedly deviate from the original context.
The original BIC criterion targets models arose from distributions belonging to an exponential family which permits a neat and simple analytical form after Laplace approximation. The neatness of this form is a blessing, but unfortunately, can be a curse as well. When the data is deprived of the independent and identically distributed (iid) structure, a blind application of BIC will not survive a close scrutiny. As discussed in Bayarri et al. (Citation2018), the sample size in BIC becomes problematic. The prior-based BIC (PBIC) proposed in Bayarri et al. (Citation2018) is essential to overcome these issues. This paper timely draws our attention to many unsettling issues related to the use of BIC in non-standard situations.
1. The classical and mutated BICs
Suppose we have a statistical model, referring to a specific family of distributions as usual, denoted as
The density function
under consideration is usually chosen to have nice mathematical properties such as being regular. The dimension of the parameter θ remains the same within a model. This assumption is not obviously seen in the above presentation.
When the above model M is chosen for a population and a random sample is provided, statistical analysis is to infer the θ value of the population. In the context of Bayesian analysis, the θ value is regarded as uncertain and the level of uncertainty is specified by a prior density, say . The combination of the prior on the θ and the data sampled from the population lead to the posterior distribution which is the basis of the statistical decision.
When there are many competing models, say , a prior probability should be decided for each of these
's. Let
denote the prior probability for
. For notational simplicity, we use
for the density function in model
,
for the dimension of
which is the parameter space of
, and we also use some obvious conventions such as M, p and θ as some generic versions.
Let be a sample of size n from a distribution which is a member of model M. By Bayes formula, the posterior probability that this M is
, is proportional to
(1)
(1) Equation (Equation1
(1)
(1) ) precisely corresponds to
of Schwarz (Citation1978, p. 462). The development in Schwarz (Citation1978) is restricted to exponential family and under the assumption that the density is a function of
through
. Other than the factor
, our (Equation1
(1)
(1) ) duplicates the function
of Bayarri et al. (Citation2018). The subindex j in our expression highlights the fact that the form of
depends on model
.
If an accurate computation of post() is cheap, then we would select the model
that maximises the posterior probability. In most cases, use some computationally feasible approximation is more realistic but may lead to complications.
Suppose consists of n independent and identically distributed observations. Let
be the maximum likelihood estimator (MLE) of
under model
and
be generic log likelihood function suitable for all
. Then under reasonable conditions, Laplace approximation leads to the authentic BIC:
(2)
(2) Note that
crucially depends on
through at least
,
and
as we understand that
and
are two names for the same notion. When n is very large, the
and
can be wrapped up to
and the aBIC(
) arrives at the classical BIC:
(3)
(3) However, unless n is very large, the size of
is not negligible. In other words, the aBIC in (Equation2
(2)
(2) ) and BIC in (Equation3
(3)
(3) ) can be very different and so that the BIC would no longer be a good approximation to the aBIC. Should
be taken into consideration? Bayarri et al. (Citation2018) gives a positive answer to this question.
Taking this in mind, let us look into details for a data set of sample size n in the classical BIC and the prior-based BIC of the authors. Recalled that the BIC in Schwarz (Citation1978) is derived when there are n iid observations from an exponential family and having these n iid observations of some dimensions but not necessarily the dimension of the parameters. The dimension of parameters p in the BIC refers to the dimension of where
is a vector of statistics, not that of x in the exponential family model. Once we leave the comfort zone of iid and exponential family, direct application of BIC in (Equation3
(3)
(3) ) is questionable though it is now a common practice. Consider an extreme case where we have n iid observations from a distribution in an exponential family, but each observation is duplicated exactly twice in the data set. The apparent sample size is therefore
but the (correct) likelihood is not affected by the duplication. Applying classical BIC merely in formality leads to the wrongful BIC:
Yet its difference from the rightful BIC is merely a constant
, which may well be regarded as part of
in aBIC. We suggest from this analysis that if omitting terms of
in BIC is acceptable, the precise definition of effective sample size is not so crucial.
Suppose that θ is a vector. When θ in a small neighbourhood of the truth, and therefore it is also in a small neighbourhood of when the MLE is consistent, we have
where
is the Fisher information matrix at
. The faithful Laplace approximation would lead to
(4)
(4) assuming
as
. By this, we realise that
remains dependent on
,
and
, but the dependence of aBIC on
has been accommodated in the Fisher information. In common applications, we may choose to omit
constants related to
and
in BIC. After which, we seem to have defined the effective sample sizes via
.
Consider the Example 1.3 of Bayarri et al. (Citation2018) where n/2 observations are iid from and another n/2 observations are iid from
. The Fisher information for θ is given by
. Our understanding is therefore in good agreement with Bayarri et al. (Citation2018). In Example 1.4 of Bayarri et al. (Citation2018), the Fisher information for θ is given by
. Hence, using the above suggested approximate aBIC, we would have get
Our suggestion on sample size is also found reasonable when applied to Example 1.2 of Bayarri et al. (Citation2018) and therefore in good agreement with the Prior-based BIC of the authors.
2. The role of parameter dimension p
Our view on the role of the dimension of the parameter p in BIC differs from Bayarri et al. (Citation2018). Our starting point is that part of omitted as an
term in aBIC to arrive at BIC is related to p. When p is very large, the resulting approximation may lead to a nonsensical model selection criterion. We use the extended BIC (EBIC) as an example which is proposed by Chen and Chen (Citation2008) that are suitable for small-n-large-p problems. Consider the classical linear regression model when n independent observations are obtained and the dimension of the explanatory variable is q. We use q instead of p to avoid potential confusion. In the era of big data, the number of explanatory variables q can be much larger than n. Let
be the collection of models where the expectation of the response is a linear combination of exactly j explanatory variables. One generally regards that each specific set of j explanatory variables makes up a model of its own right. Let
,
be these models. From this angle, the cardinality of
is
. When aBIC is used, a prior probability
is required for every
and they lead to a total prior probability for
.
Suppose one puts as it is clearly the default choice in BIC, we have
for model set
. When q=1,000,000, we have
. In small-n-large-p problems, this implies that a linear model with two explanatory variables is 50,000 times more likely to be selected than a model with one explanatory variable if BIC is applied without any modifications. This is apparently controversial and leads to inconsistent model selection when n has a lower order than q.
To fix this problem, Chen and Chen (Citation2008) suggest to put
for some
. In applications, one would put an upper bound J not depending on n or q the number of explanatory variables allowed. Applying the Laplace approximation, the Extended BIC is obtained:
Although the choice of
is most natural, their simulation results suggest that the choice of
is a better trade-off between model complexity and parsimony. When q is very large, EBIC demands stronger evidence in order to accommodate a model with another explanatory variable.
The development of EBIC largely overlooks other Bayesian aspects of BIC. Refinements along the line of PBIC can be fruitful.
Disclosure statement
No potential conflict of interest was reported by the authors.
Additional information
Funding
Notes on contributors
Jiahua Chen
Professor Jiahua Chen is Canada Research Chair, Tier I at the University of British Columbia.
Zeny Feng
Professor Zeny Feng is an associate professor at the University of Guelph.
References
- Bayarri, M., Berger, J. O., Jang, W., Ray, S., Pericchi, L. R., & Visser, I. (2018). Prior-based Bayesian information criterion. Statistical Theory and Related Fields.
- Chen, J., & Chen, Z. (2008). Extended Bayesian information criteria for model selection with large model spaces. Biometrika, 95(3), 759–771. doi: 10.1093/biomet/asn034
- Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6(2), 461–464. doi: 10.1214/aos/1176344136