201
Views
0
CrossRef citations to date
0
Altmetric
Discussion Paper and Discussions

Discussion of ‘Prior-based Bayesian Information Criterion (PBIC)’

Pages 26-29 | Received 16 Apr 2019, Accepted 22 Apr 2019, Published online: 02 May 2019

1. Summary

The authors have the basis for a reformulation of the BIC as we think of it now. This problem is both hard and important. In particular, to address it, the authors have put six incisive ideas in sequence. The first is the separation of parameters that are common across models versus those that aren't. The second is the use of an orthogonal (why not orthonormal?) transformation of the Fisher information matrix to get diagonal entries di that summarise the parameter-by-parameter efficiency of estimation. The third is using Laplace's method only on the likelihood, i.e. Taylor expanding the log-likelihood and using the MLE rather than centring a Taylor expansion at the posterior mode. (From an estimation standpoint the difference between the MLE and posterior mode is O(1/n) and can be neglected.) The fourth is the particular prior selection that the third step enables. Since the prior is not approximated by, say π(θˆ), the prior can be chosen to have an impact and the only way the prior won't wash out is if it's tails are heavy enough. Fifth is defining an effective sample size nie that differs from parameter to parameter. Finally, sixth, is imposing a relationship between the diagonal elements di and the ‘unit information’ bi by way of the nie. (All notation and terminology here is the same as in the paper, unless otherwise noted.)

Taken together, the result is a PBIC that arises as an approximation to 2logm(xn), where Xn=(X1,,Xn)=(x1,,xn)=xn is the data. This matches the O(1) asymptotics of the usual BIC.

The main improvement in perspective on the BIC that this paper provides is the observation that different efficiencies for estimating different parameters are important to include in model selection. Intuitively, if a parameter is easier to estimate in one model (larger Fisher information) than the corresponding parameter in another model then ceteris paribus the first model should be preferred. (The use of ceteris paribus covers a lot of ground, but helps make the point about efficiency.) Neglecting comparitive efficiences of parameters is an important gap to fill in the literature on the BC and model selection more generally.

The focus on the Fisher information I() – see Sec. 3.2 in particular – supports this view, however, one must wonder if there is more to be gained from either off-diagonal elements of I or from the orthogonal (orthonormal) matrix O. The constraint bi=niedi is also a little puzzling. It makes sense because bi is interpretable as something like the Fisher information relative to parameter i. (In this sense it's not clear why it's called the ‘unit information’.) The prior selection is very perceptive – and works – but there does not seem to be any unique, general conceptual property that it possesses. Even though it gives an effective result, the prior selection seems a little artificial. The authors may of course counter-argue that one of the reasons to use a prior is precisely that it represents information one has outside the data.

Setting aside such knit-picking, let us turn to the substance of the contribution.

2. Other forms for the BIC?

For comparison, let us try to modify the BIC in three other ways. The first is a refinement of the BIC to identify the constant c in Result 1.1. The second is to look more closely at the contrast between the PBIC the authors propose and a more conventional approach. The third is a discussion of an alternative that starts with an effective sample size rather than bringing it in via the prior.

First observe that the conventional expression for the BIC is actually only accurate to OP(1) not oP(1). However, the constant term can be identified. Let xn be IID Pθ. Staring at Result 1.1 and using standard Lapace's method analysis of m(xn) gives that (1) |logp(Xnθˆ)π(θˆ)m(Xn)p2logn2π12logdetIˆ(θˆ)|0,(1)

in Pθ-probability; see Clarke and Barron (Citation1988). So, a more refined version of the BIC expression, which approximates the posterior mode, is (2) BICbetter=2(θˆ)+plogn=2logm(xn)+plog2πlogdetIˆ(θˆ)+2logπ(θˆ)+oP(1).(2) Using (Equation2) may largely address Problem 1 as identified by the authors. Minimising BICbetter over candidate models is loosely like maximising m(xn) subject to a penalty term in p and I, i.e. loosely like finding the model that achieves the maximal penalised maximum likelihood if the mixture density were taken as the likelihood. Expression (Equation2) can be re-arranged to give an expression for m(xn). Indeed, one can plausibly argue that maximising m(xn) over models (and priors) under some restrictions should be a useful statistic for model selection.

This is intuitively reasonable … until you want to take the intuition of the authors into account, viz. that different θj's in θ=(θ1,,θp) require different sample sizes to estimate equally well or correspond to different effective sample sizes. One expects this effect to be greater as more and more models are under consideration. It is therefore natural to focus on the parameters that distinguish the models from each other rather than the common parameters. So, for ease of exposition we assume that θ=θ(1) i.e. that θ(2) does not appear. (In simple examples like linear regression θ(2) often corresponds to the intercept and can be removed by centring the data.)

So, second, let us look at the Laplace's method applied to m(xn). Being informal about a second order Taylor expansion and using standard notation gives m(xn)=p(xnθ)π(θ)dθ=p(xnθˆ)×e(n/2)(θθˆ)TIˆ(θ~)(θθˆ)π(θ)dθ. (The domain of intgration is Rp but this can be cut down to a ball B(θˆ,ϵ) by allowing error terms of order O(enγ) for suitable γ>0. Then, the Taylor expansion can be used. Finally, one can go back to the original domain of integration again by adding an exponentialy small error term.) Standard conditions (see e.g. Clarke & Barron, Citation1988) give that the θ~ can be replaced by θT and the empirical Fisher information, Iˆ, can be replaced by the actual Fisher information. Thus: m(xn)p(xnθˆ)e(n/2)(θθˆ)TI(θT)(θθˆ)π(θ)dθ. The integrand is a normal density that can be integrated in closed form, apart from the π. By another approximation (that seems to be asymptotically tight up to OP(ϵnp/2) factor where ε can be arbitrarily small) we get: (3) m(xn)p(xnθˆ)e(n/2)(θθT)TI(θT)(θθT)π(θ)dθ.(3) So far, this is standard. It becomes more interesting when the technique of the authors is invoked. Essentially, they diagonalise I(θT). For this, the p eigenvalues must be strictly positive, but that is not usually a difficult assumption to satisfy. Write D=OtI(θT)O where O is an orthonormal matrix, i.e., a rotation, and D=diag(d1,,dp). (The authors use an orthogonal matrix, but an orthonormal matrix seems to give cleaner results.) Now, consider the transformation ξ=OT(θθT) so that dξ=dθ by the orthonormality of O. Note that the transformation has been simplified since the argument of O is θT. Now, the integral in right hand side of expression (Equation3) is (4) e(n/2)ξTOTI(θT)Oξπ(O(θT)ξ+θT)dθ=e(n/2)ξTDξπ(O(θT)ξ+θT)dθ.(4) At this point the authors, rather than using Laplace's method on the integral, choose π as a product of individual πi's for each ξi. Each factor in that product has hyperparameters λi, di, and bi and the resulting p-dimensional integral in (Equation4) has a closed form as given at the end of Sec. 2.

An alternative is the more conventional approach of recognising that as n the integrand converges to unit mass at ξ=0. Using this gives that m(xn) is approximately p(xnθˆ)w(θT)(2π)p/2det(nD)1/2×e(1/2)ξ((nD)1)1ξdξ(2π)p/2det(nD)1/2=p(xnθˆ)w(θT)(2π)p/21np/2((Πi=1pdi)1/p)p/2=p(xnθˆ)w(θT)(2π)p/21(ns)p/2, where s is the geometric mean of the di's. The geometric mean is the side length of a p-dimensional cuboid whith volume equal do Πi=1pdi. Thus, s plays the role of a sort of average Fisher information for the collection of ξi's. This sequence of approximations gives logm(xn)=(θˆ)+logw(θT)+p2log(2π)p2logns. This leads to a form of the BIC as (5) BICS=2(θˆ)+plogn=2logm(xn)+2logw(θT)plog(2π)plogs+o(1).(5) Comparing (Equation5) and (Equation2), the only difference is that the Fisher information is summarised by s, a sort of average efficiency that in effect puts all parameters on the same scale. Roughly, plogs and logdetI(θ) correspond to the term i=1plog(1+nie) in the PBIC. The extra term in the PBIC, 2i=1plog((1evi)/2vi), seems to correspond to the log prior density term.

As a third way to look at the BIC, oberve that neither (Equation5) nor (Equation2) have any clear analog to nie apart from the treatment of Fisher information and its interpretation as an efficiency. So, two natural questions are what the effective sample sizes mean and what they are doing. In the PBIC they are introduced as hyperparameters and are restricted to linear models. For instance, in Example 3.3, effective sample sizes are average precisions divided by the maximal precision even though it is unclear why this expression has a claim to be an effective sample size.

On the other hand, in Sec. 3.2 a general definition of nje in terms of entries of I(θ) is given for each j=1,p. This is a valid generalisation of sample size because the nje's reduce to n. Indeed, in the IID case with large n, Ijj(θˆ)nIjj(θT) and wij1/n. So, i=1nwijIijj(θT)(1/n)i=1nIijj(θT)(1/n)(nIjj(θT))=Ijj(θT). This gives nienIjj(θT)/Ijj(θT)=n. In this generalisation, each nje is closely related to the Fisher information and hence to the relative efficiency of estimating different parameters. Indeed, nie is, roughly, the total Fisher information for θi (over the sample) as a fraction of the convex combination of Fisher informations for the θj's over the data.

Now, it may make sense to use the definition of nje in Sec. 3.2 to generalise the BIC directly, i.e. find the nje's first, since they depend only on the Fisher informations and on xn, and use them to propose a new BIC. For instance, consider (6) BICTESS=2(θˆ)+i=1plognie.(6) In (Equation6), the concept of effective sample size is used to account for the different efficiencies of estimating different parameters, making it valid to compare them. Note that (Equation6) levels the playing field for the fi(θ)'s in the log-likelihood so that they do not need to be modified. Thus, effective sample sizes have a meaning something like the sample size required to make the estimation of one parameter (to a prescribed accuracy) close to the sample size required to estimate another parameter (to the same accuracy), a parallel to the appearance of the geometric mean in (Equation5).

At this point, one can go back to (Equation3) and (Equation4) and seek ways to justify using nie in place of n. Because (Equation4) is nearly a product of univariate integrals it may be possible to regard the elements on the diagonal of D as a form of the Fisher information that permits replacement of n with nie. Similarly, the geometric mean used in (Equation5) may be related (by, say, log) to the ratios of sums of Fisher informations used to define nie in Sec. 3.2 thereby relating (Equation5) and (Equation6). Finally, (Equation6) is not obviously related to m(xn) but one can hope that a suitably reformulated Laplace's method on (Equation3) and (Equation4) may lead to a compatible expression for it.

One interesting query the authors are well-placed to answer is whether the results of Sec. 5.5 hold if the PBIC is replaced by (Equation6). After all, there should be reasonable conditions under which all the nie's from Sec. 3.2 increase fast enough with n, e.g. for all n, 0<η<minjIjjIjj<maxjIjj<B<.

3. Where to from here?

The authors have a very promising general definition in Sec. 3.2. Establishing a relationship between nje and the effective sample size formulae proposed for linear models would be useful, but more fundamentally, the question is whether the nje from Sec. 3.2 makes sense in such simpler contexts. If it does, then the fact that it differs from ‘TESS’ may not be very important. We strongly agree with the authors who write, á propos of nje, that it should ‘be viewed primarily as a starting point for future investigations of effective sample size’. (They actually limit this point to nonlinear models, but for the sake of a satisfying overall theory it should apply to linear models as well.)

Another tack is to be overtly information-theoretic by defining an effective sample size in terms of codelength. One form of the relative entropy, see Clarke and Barron (Citation1988), is implicit in (Equation2). However, one can use an analogous formulation to convert a putative sample of size n to an effective sample. Use a nonparametric estimator to form h(x;xn), an estimate of the density of X. Then, choose a ‘distortion rate’, r and find zm for the smallest value of m that satisfies D(h(;xn)h(;zm))r, where D() is the relative entropy. This is the effective sample and sample size since it recreates the empirical density with a tolerable level of distortion. The larger r is, the more distortion is allowed and the smaller m will be. Information-theoretically, this is the same as approximating a Shannon code based on h(;xn) by a Shannon code based on h(;zm) in terms of small redundancy, in say, bits. This definition for effective sample size requires choosing r, but D is in bits so it would make sense for r to be some function of bits per symbol e.g. ϵ(σn), where Var(X)=σ2, for some c(0,1], with ϵ=1/2 as a default.

Another way to look at this procedure for finding an effective sample size is via data compression. In this context, the rate distortion function is a well-studied quantity, see Cover and Thomas (Citation2006), Chap. 10. The problem is that it's not obvious how to obtain an effective sample size from the rate distortion function, or, in the parlance of information theory, a set of lower dimensional canonical representatives that achieve the rate distortion function lower bound. On the other hand, this can be done in practice and further study may yield good solutions.

Finally, the rate distortion function is the result of an operation performed on a Shannon mutual information that, for parameteric families, usually has an expression in terms of the Fisher information. Likewise, it is well known that certain relative entropies can be expressed in terms of Fisher information. So, the definitions of effective sample size from an information theory perspective (via rate distortion) and form Sec. 3.2 (via efficiency) may ultimately coincide.

Disclosure statement

No potential conflict of interest was reported by the author.

References

  • Clarke, B., & Barron, A.1988). Information-theoretic asymptotics of Bayes methods (Technical Report #26). Stat. Dept., Univ. Illinois.
  • Cover, T., & Thomas, J. (2006). Elements of information theory (2nd ed.). Hoboken, NJ: John Wiley and Sons.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.