201

Views

CrossRef citations to date

Altmetric

Discussion Paper and Discussions

Discussion of ‘Prior-based Bayesian Information Criterion (PBIC)’

Bertrand S. ClarkeStatistics Department, University of Nebraska-LincolnLincoln, NE, USACorrespondence[email protected]

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

1. Summary

The authors have the basis for a reformulation of the BIC as we think of it now. This problem is both hard and important. In particular, to address it, the authors have put six incisive ideas in sequence. The first is the separation of parameters that are common across models versus those that aren't. The second is the use of an orthogonal (why not orthonormal?) transformation of the Fisher information matrix to get diagonal entries $d_{i}$ that summarise the parameter-by-parameter efficiency of estimation. The third is using Laplace's method only on the likelihood, i.e. Taylor expanding the log-likelihood and using the MLE rather than centring a Taylor expansion at the posterior mode. (From an estimation standpoint the difference between the MLE and posterior mode is $O (1 / n)$ and can be neglected.) The fourth is the particular prior selection that the third step enables. Since the prior is not approximated by, say $π (\hat{θ})$ , the prior can be chosen to have an impact and the only way the prior won't wash out is if it's tails are heavy enough. Fifth is defining an effective sample size $n_{i}^{e}$ that differs from parameter to parameter. Finally, sixth, is imposing a relationship between the diagonal elements $d_{i}$ and the ‘unit information’ $b_{i}$ by way of the $n_{i}^{e}$ . (All notation and terminology here is the same as in the paper, unless otherwise noted.)

Taken together, the result is a PBIC that arises as an approximation to $- 2 \log m (x^{n})$ , where $X^{n} = (X_{1}, \dots, X_{n}) = (x_{1}, \dots, x_{n}) = x^{n}$ is the data. This matches the $O (1)$ asymptotics of the usual BIC.

The main improvement in perspective on the BIC that this paper provides is the observation that different efficiencies for estimating different parameters are important to include in model selection. Intuitively, if a parameter is easier to estimate in one model (larger Fisher information) than the corresponding parameter in another model then ceteris paribus the first model should be preferred. (The use of ceteris paribus covers a lot of ground, but helps make the point about efficiency.) Neglecting comparitive efficiences of parameters is an important gap to fill in the literature on the BC and model selection more generally.

The focus on the Fisher information $I (\cdot)$ – see Sec. 3.2 in particular – supports this view, however, one must wonder if there is more to be gained from either off-diagonal elements of I or from the orthogonal (orthonormal) matrix $O$ . The constraint $b_{i} = n_{i}^{e} d_{i}$ is also a little puzzling. It makes sense because $b_{i}$ is interpretable as something like the Fisher information relative to parameter i. (In this sense it's not clear why it's called the ‘unit information’.) The prior selection is very perceptive – and works – but there does not seem to be any unique, general conceptual property that it possesses. Even though it gives an effective result, the prior selection seems a little artificial. The authors may of course counter-argue that one of the reasons to use a prior is precisely that it represents information one has outside the data.

Setting aside such knit-picking, let us turn to the substance of the contribution.

2. Other forms for the BIC?

For comparison, let us try to modify the BIC in three other ways. The first is a refinement of the BIC to identify the constant c in Result 1.1. The second is to look more closely at the contrast between the PBIC the authors propose and a more conventional approach. The third is a discussion of an alternative that starts with an effective sample size rather than bringing it in via the prior.

First observe that the conventional expression for the BIC is actually only accurate to $O_{P} (1)$ not $o_{P} (1)$ . However, the constant term can be identified. Let $x^{n}$ be IID $P_{θ}$ . Staring at Result 1.1 and using standard Lapace's method analysis of $m (x^{n})$ gives that (1) $| \log \frac{p (X^{n} ∣ \hat{θ}) π (\hat{θ})}{m (X^{n})} - \frac{p}{2} \log \frac{n}{2 π} - \frac{1}{2} \log det \hat{I} (\hat{θ}) | \to 0,$ (1)

in $P_{θ}$ -probability; see Clarke and Barron (Citation1988). So, a more refined version of the BIC expression, which approximates the posterior mode, is (2) $\begin{aligned} {BIC}_{better} & = - 2 ℓ (\hat{θ}) + p \log n = - 2 \log m (x^{n}) + p \log 2 π \\ - \log det \hat{I} (\hat{θ}) + 2 \log π (\hat{θ}) + o_{P} (1) . \end{aligned}$ (2) Using (Equation2(2) $\begin{aligned} {BIC}_{better} & = - 2 ℓ (\hat{θ}) + p \log n = - 2 \log m (x^{n}) + p \log 2 π \\ - \log det \hat{I} (\hat{θ}) + 2 \log π (\hat{θ}) + o_{P} (1) . \end{aligned}$ (2) ) may largely address Problem 1 as identified by the authors. Minimising ${BIC}_{better}$ over candidate models is loosely like maximising $m (x^{n})$ subject to a penalty term in p and I, i.e. loosely like finding the model that achieves the maximal penalised maximum likelihood if the mixture density were taken as the likelihood. Expression (Equation2(2) $\begin{aligned} {BIC}_{better} & = - 2 ℓ (\hat{θ}) + p \log n = - 2 \log m (x^{n}) + p \log 2 π \\ - \log det \hat{I} (\hat{θ}) + 2 \log π (\hat{θ}) + o_{P} (1) . \end{aligned}$ (2) ) can be re-arranged to give an expression for $m (x^{n})$ . Indeed, one can plausibly argue that maximising $m (x^{n})$ over models (and priors) under some restrictions should be a useful statistic for model selection.

This is intuitively reasonable … until you want to take the intuition of the authors into account, viz. that different $θ_{j}$ 's in $θ = (θ_{1}, \dots, θ_{p})$ require different sample sizes to estimate equally well or correspond to different effective sample sizes. One expects this effect to be greater as more and more models are under consideration. It is therefore natural to focus on the parameters that distinguish the models from each other rather than the common parameters. So, for ease of exposition we assume that $θ = θ_{(1)}$ i.e. that $θ_{(2)}$ does not appear. (In simple examples like linear regression $θ_{(2)}$ often corresponds to the intercept and can be removed by centring the data.)

So, second, let us look at the Laplace's method applied to $m (x^{n})$ . Being informal about a second order Taylor expansion and using standard notation gives $\begin{aligned} m (x^{n}) & = \int p (x^{n} ∣ θ) π (θ) d θ = p (x^{n} ∣ \hat{θ}) \\ \times \int e^{- (n / 2) (θ - \hat{θ})^{T} \hat{I} (\tilde{θ}) (θ - \hat{θ})} π (θ) d θ . \end{aligned}$ (The domain of intgration is $R^{p}$ but this can be cut down to a ball $B (\hat{θ}, ϵ)$ by allowing error terms of order $O (e^{- n γ})$ for suitable $γ > 0$ . Then, the Taylor expansion can be used. Finally, one can go back to the original domain of integration again by adding an exponentialy small error term.) Standard conditions (see e.g. Clarke & Barron, Citation1988) give that the $\tilde{θ}$ can be replaced by $θ_{T}$ and the empirical Fisher information, $\hat{I}$ , can be replaced by the actual Fisher information. Thus: $m (x^{n}) \approx p (x^{n} ∣ \hat{θ}) \int e^{- (n / 2) (θ - \hat{θ})^{T} I (θ_{T}) (θ - \hat{θ})} π (θ) d θ .$ The integrand is a normal density that can be integrated in closed form, apart from the π. By another approximation (that seems to be asymptotically tight up to $O_{P} (ϵ n^{p / 2})$ factor where ε can be arbitrarily small) we get: (3) $m (x^{n}) \approx p (x^{n} ∣ \hat{θ}) \int e^{- (n / 2) (θ - θ_{T})^{T} I (θ_{T}) (θ - θ_{T})} π (θ) d θ .$ (3) So far, this is standard. It becomes more interesting when the technique of the authors is invoked. Essentially, they diagonalise $I (θ_{T})$ . For this, the p eigenvalues must be strictly positive, but that is not usually a difficult assumption to satisfy. Write $D = O^{t} I (θ_{T}) O$ where O is an orthonormal matrix, i.e., a rotation, and $D = diag (d_{1}, \dots, d_{p})$ . (The authors use an orthogonal matrix, but an orthonormal matrix seems to give cleaner results.) Now, consider the transformation $ξ = O^{T} (θ - θ_{T})$ so that $d ξ = d θ$ by the orthonormality of O. Note that the transformation has been simplified since the argument of O is $θ_{T}$ . Now, the integral in right hand side of expression (Equation3(3) $m (x^{n}) \approx p (x^{n} ∣ \hat{θ}) \int e^{- (n / 2) (θ - θ_{T})^{T} I (θ_{T}) (θ - θ_{T})} π (θ) d θ .$ (3) ) is (4) $\begin{aligned} \int e^{- (n / 2) ξ^{T} O^{T} I (θ_{T}) O ξ} π (O (θ_{T}) ξ + θ_{T}) d θ \\ = \int e^{- (n / 2) ξ^{T} D ξ} π (O (θ_{T}) ξ + θ_{T}) d θ . \end{aligned}$ (4) At this point the authors, rather than using Laplace's method on the integral, choose π as a product of individual $π_{i}$ 's for each $ξ_{i}$ . Each factor in that product has hyperparameters $λ_{i}$ , $d_{i}$ , and $b_{i}$ and the resulting p-dimensional integral in (Equation4(4) $\begin{aligned} \int e^{- (n / 2) ξ^{T} O^{T} I (θ_{T}) O ξ} π (O (θ_{T}) ξ + θ_{T}) d θ \\ = \int e^{- (n / 2) ξ^{T} D ξ} π (O (θ_{T}) ξ + θ_{T}) d θ . \end{aligned}$ (4) ) has a closed form as given at the end of Sec. 2.

An alternative is the more conventional approach of recognising that as $n \to \infty$ the integrand converges to unit mass at $ξ = 0$ . Using this gives that $m (x^{n})$ is approximately $\begin{aligned} p (x^{n} ∣ \hat{θ}) w (θ_{T}) (2 π)^{p / 2} det (n D)^{- 1 / 2} \\ \times \frac{\int e^{- (1 / 2) ξ ((n D)^{- 1})^{- 1} ξ} d ξ}{(2 π)^{p / 2} det (n D)^{- 1 / 2}} \\ = p (x^{n} ∣ \hat{θ}) w (θ_{T}) (2 π)^{p / 2} \frac{1}{n^{p / 2} ((Π_{i = 1}^{p} d_{i})^{1 / p})^{p / 2}} \\ = p (x^{n} ∣ \hat{θ}) w (θ_{T}) (2 π)^{p / 2} \frac{1}{(n s)^{p / 2}}, \end{aligned}$ where s is the geometric mean of the $d_{i}$ 's. The geometric mean is the side length of a p-dimensional cuboid whith volume equal do $Π_{i = 1}^{p} d_{i}$ . Thus, s plays the role of a sort of average Fisher information for the collection of $ξ_{i}$ 's. This sequence of approximations gives $\log m (x^{n}) = ℓ (\hat{θ}) + \log w (θ_{T}) + \frac{p}{2} \log (2 π) - \frac{p}{2} \log n s .$ This leads to a form of the BIC as (5) $\begin{aligned} B I C_{S} & = - 2 ℓ (\hat{θ}) + p \log n = - 2 \log m (x^{n}) \\ + 2 \log w (θ_{T}) - p \log (2 π) - p \log s + o (1) . \end{aligned}$ (5) Comparing (Equation5(5) $\begin{aligned} B I C_{S} & = - 2 ℓ (\hat{θ}) + p \log n = - 2 \log m (x^{n}) \\ + 2 \log w (θ_{T}) - p \log (2 π) - p \log s + o (1) . \end{aligned}$ (5) ) and (Equation2(2) $\begin{aligned} {BIC}_{better} & = - 2 ℓ (\hat{θ}) + p \log n = - 2 \log m (x^{n}) + p \log 2 π \\ - \log det \hat{I} (\hat{θ}) + 2 \log π (\hat{θ}) + o_{P} (1) . \end{aligned}$ (2) ), the only difference is that the Fisher information is summarised by s, a sort of average efficiency that in effect puts all parameters on the same scale. Roughly, $p \log s$ and $\log det I (θ)$ correspond to the term $\sum_{i = 1}^{p} \log (1 + n_{i}^{e})$ in the PBIC. The extra term in the PBIC, $- 2 \sum_{i = 1}^{p} \log ((1 - e^{v_{i}}) / \sqrt{2} v_{i})$ , seems to correspond to the log prior density term.

As a third way to look at the BIC, oberve that neither (Equation5(5) $\begin{aligned} B I C_{S} & = - 2 ℓ (\hat{θ}) + p \log n = - 2 \log m (x^{n}) \\ + 2 \log w (θ_{T}) - p \log (2 π) - p \log s + o (1) . \end{aligned}$ (5) ) nor (Equation2(2) $\begin{aligned} {BIC}_{better} & = - 2 ℓ (\hat{θ}) + p \log n = - 2 \log m (x^{n}) + p \log 2 π \\ - \log det \hat{I} (\hat{θ}) + 2 \log π (\hat{θ}) + o_{P} (1) . \end{aligned}$ (2) ) have any clear analog to $n_{i}^{e}$ apart from the treatment of Fisher information and its interpretation as an efficiency. So, two natural questions are what the effective sample sizes mean and what they are doing. In the PBIC they are introduced as hyperparameters and are restricted to linear models. For instance, in Example 3.3, effective sample sizes are average precisions divided by the maximal precision even though it is unclear why this expression has a claim to be an effective sample size.

On the other hand, in Sec. 3.2 a general definition of $n_{j}^{e}$ in terms of entries of $I (θ)$ is given for each $j = 1, \dots p$ . This is a valid generalisation of sample size because the $n_{j}^{e}$ 's reduce to n. Indeed, in the IID case with large n, $I_{j j}^{*} (\hat{θ}) \approx n I_{j j} (θ_{T})$ and $w_{i j} \approx 1 / n$ . So, $\sum_{i = 1}^{n} w_{i j} I_{i j j}^{*} (θ_{T}) \approx (1 / n) \sum_{i = 1}^{n} I_{i j j}^{*} (θ_{T}) \approx (1 / n) (n I_{j j} (θ_{T})) = I_{j j} (θ_{T})$ . This gives $n_{i}^{e} \approx n I_{j j} (θ_{T})^{*} / I_{j j} (θ_{T}) = n$ . In this generalisation, each $n_{j}^{e}$ is closely related to the Fisher information and hence to the relative efficiency of estimating different parameters. Indeed, $n_{i}^{e}$ is, roughly, the total Fisher information for $θ_{i}$ (over the sample) as a fraction of the convex combination of Fisher informations for the $θ_{j}$ 's over the data.

Now, it may make sense to use the definition of $n_{j}^{e}$ in Sec. 3.2 to generalise the BIC directly, i.e. find the $n_{j}^{e}$ 's first, since they depend only on the Fisher informations and on $x^{n}$ , and use them to propose a new BIC. For instance, consider (6) $BI C_{TESS} = - 2 ℓ (\hat{θ}) + \sum_{i = 1}^{p} \log n_{i}^{e} .$ (6) In (Equation6(6) $BI C_{TESS} = - 2 ℓ (\hat{θ}) + \sum_{i = 1}^{p} \log n_{i}^{e} .$ (6) ), the concept of effective sample size is used to account for the different efficiencies of estimating different parameters, making it valid to compare them. Note that (Equation6(6) $BI C_{TESS} = - 2 ℓ (\hat{θ}) + \sum_{i = 1}^{p} \log n_{i}^{e} .$ (6) ) levels the playing field for the $f_{i} (\cdot θ)$ 's in the log-likelihood so that they do not need to be modified. Thus, effective sample sizes have a meaning something like the sample size required to make the estimation of one parameter (to a prescribed accuracy) close to the sample size required to estimate another parameter (to the same accuracy), a parallel to the appearance of the geometric mean in (Equation5(5) $\begin{aligned} B I C_{S} & = - 2 ℓ (\hat{θ}) + p \log n = - 2 \log m (x^{n}) \\ + 2 \log w (θ_{T}) - p \log (2 π) - p \log s + o (1) . \end{aligned}$ (5) ).

At this point, one can go back to (Equation3(3) $m (x^{n}) \approx p (x^{n} ∣ \hat{θ}) \int e^{- (n / 2) (θ - θ_{T})^{T} I (θ_{T}) (θ - θ_{T})} π (θ) d θ .$ (3) ) and (Equation4(4) $\begin{aligned} \int e^{- (n / 2) ξ^{T} O^{T} I (θ_{T}) O ξ} π (O (θ_{T}) ξ + θ_{T}) d θ \\ = \int e^{- (n / 2) ξ^{T} D ξ} π (O (θ_{T}) ξ + θ_{T}) d θ . \end{aligned}$ (4) ) and seek ways to justify using $n_{i}^{e}$ in place of n. Because (Equation4(4) $\begin{aligned} \int e^{- (n / 2) ξ^{T} O^{T} I (θ_{T}) O ξ} π (O (θ_{T}) ξ + θ_{T}) d θ \\ = \int e^{- (n / 2) ξ^{T} D ξ} π (O (θ_{T}) ξ + θ_{T}) d θ . \end{aligned}$ (4) ) is nearly a product of univariate integrals it may be possible to regard the elements on the diagonal of D as a form of the Fisher information that permits replacement of n with $n_{i}^{e}$ . Similarly, the geometric mean used in (Equation5(5) $\begin{aligned} B I C_{S} & = - 2 ℓ (\hat{θ}) + p \log n = - 2 \log m (x^{n}) \\ + 2 \log w (θ_{T}) - p \log (2 π) - p \log s + o (1) . \end{aligned}$ (5) ) may be related (by, say, log) to the ratios of sums of Fisher informations used to define $n_{i}^{e}$ in Sec. 3.2 thereby relating (Equation5(5) $\begin{aligned} B I C_{S} & = - 2 ℓ (\hat{θ}) + p \log n = - 2 \log m (x^{n}) \\ + 2 \log w (θ_{T}) - p \log (2 π) - p \log s + o (1) . \end{aligned}$ (5) ) and (Equation6(6) $BI C_{TESS} = - 2 ℓ (\hat{θ}) + \sum_{i = 1}^{p} \log n_{i}^{e} .$ (6) ). Finally, (Equation6(6) $BI C_{TESS} = - 2 ℓ (\hat{θ}) + \sum_{i = 1}^{p} \log n_{i}^{e} .$ (6) ) is not obviously related to $m (x^{n})$ but one can hope that a suitably reformulated Laplace's method on (Equation3(3) $m (x^{n}) \approx p (x^{n} ∣ \hat{θ}) \int e^{- (n / 2) (θ - θ_{T})^{T} I (θ_{T}) (θ - θ_{T})} π (θ) d θ .$ (3) ) and (Equation4(4) $\begin{aligned} \int e^{- (n / 2) ξ^{T} O^{T} I (θ_{T}) O ξ} π (O (θ_{T}) ξ + θ_{T}) d θ \\ = \int e^{- (n / 2) ξ^{T} D ξ} π (O (θ_{T}) ξ + θ_{T}) d θ . \end{aligned}$ (4) ) may lead to a compatible expression for it.

One interesting query the authors are well-placed to answer is whether the results of Sec. 5.5 hold if the PBIC is replaced by (Equation6(6) $BI C_{TESS} = - 2 ℓ (\hat{θ}) + \sum_{i = 1}^{p} \log n_{i}^{e} .$ (6) ). After all, there should be reasonable conditions under which all the $n_{i}^{e}$ 's from Sec. 3.2 increase fast enough with n, e.g. for all n, $0 < η < min_{j} I_{j j}^{*} \leq I_{j j}^{*} < max_{j} I_{j j}^{*} < B < \infty$ .

3. Where to from here?

The authors have a very promising general definition in Sec. 3.2. Establishing a relationship between $n_{j}^{e}$ and the effective sample size formulae proposed for linear models would be useful, but more fundamentally, the question is whether the $n_{j}^{e}$ from Sec. 3.2 makes sense in such simpler contexts. If it does, then the fact that it differs from ‘TESS’ may not be very important. We strongly agree with the authors who write, á propos of $n_{j}^{e}$ , that it should ‘be viewed primarily as a starting point for future investigations of effective sample size’. (They actually limit this point to nonlinear models, but for the sake of a satisfying overall theory it should apply to linear models as well.)

Another tack is to be overtly information-theoretic by defining an effective sample size in terms of codelength. One form of the relative entropy, see Clarke and Barron (Citation1988), is implicit in (Equation2(2) $\begin{aligned} {BIC}_{better} & = - 2 ℓ (\hat{θ}) + p \log n = - 2 \log m (x^{n}) + p \log 2 π \\ - \log det \hat{I} (\hat{θ}) + 2 \log π (\hat{θ}) + o_{P} (1) . \end{aligned}$ (2) ). However, one can use an analogous formulation to convert a putative sample of size n to an effective sample. Use a nonparametric estimator to form $h (x; x^{n})$ , an estimate of the density of X. Then, choose a ‘distortion rate’, r and find $z^{m}$ for the smallest value of m that satisfies $D (h (\cdot; x^{n}) ∥ h (\cdot; z^{m})) \leq r$ , where $D (\cdot ∥ \cdot)$ is the relative entropy. This is the effective sample and sample size since it recreates the empirical density with a tolerable level of distortion. The larger r is, the more distortion is allowed and the smaller m will be. Information-theoretically, this is the same as approximating a Shannon code based on $h (\cdot; x^{n})$ by a Shannon code based on $h (\cdot; z^{m})$ in terms of small redundancy, in say, bits. This definition for effective sample size requires choosing r, but D is in bits so it would make sense for r to be some function of bits per symbol e.g. $ϵ (σ n)$ , where $Var (X) = σ^{2}$ , for some $c \in (0, 1]$ , with $ϵ = 1 / 2$ as a default.

Another way to look at this procedure for finding an effective sample size is via data compression. In this context, the rate distortion function is a well-studied quantity, see Cover and Thomas (Citation2006), Chap. 10. The problem is that it's not obvious how to obtain an effective sample size from the rate distortion function, or, in the parlance of information theory, a set of lower dimensional canonical representatives that achieve the rate distortion function lower bound. On the other hand, this can be done in practice and further study may yield good solutions.

Finally, the rate distortion function is the result of an operation performed on a Shannon mutual information that, for parameteric families, usually has an expression in terms of the Fisher information. Likewise, it is well known that certain relative entropies can be expressed in terms of Fisher information. So, the definitions of effective sample size from an information theory perspective (via rate distortion) and form Sec. 3.2 (via efficiency) may ultimately coincide.

Disclosure statement

No potential conflict of interest was reported by the author.

References

Clarke, B., & Barron, A.1988). Information-theoretic asymptotics of Bayes methods (Technical Report #26). Stat. Dept., Univ. Illinois.
Google Scholar
Cover, T., & Thomas, J. (2006). Elements of information theory (2nd ed.). Hoboken, NJ: John Wiley and Sons.
Google Scholar

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Order Reprints Request Corporate Permissions

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

Request Academic Permissions

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.

Discussion of ‘Prior-based Bayesian Information Criterion (PBIC)’

1. Summary

2. Other forms for the BIC?

3. Where to from here?

Disclosure statement

References

Information for

Open access

Opportunities

Help and information

Discussion of ‘Prior-based Bayesian Information Criterion (PBIC)’

1. Summary

2. Other forms for the BIC?

3. Where to from here?

Disclosure statement

References

Reprints and Corporate Permissions

Academic Permissions

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date