Search in:

Statistical Theory and Related Fields Volume 3, 2019 - Issue 1

Submit an article Journal homepage

Free access

178

Views

CrossRef citations to date

Altmetric

Listen

Discussion Paper and Discussions

A discussion of ‘prior-based Bayesian information criterion (PBIC)’

Jun ShaoDepartment of Statistics, University of Wisconsin-Madison, Madison, WI, USACorrespondence[email protected]

Sheng ZhangDepartment of Statistics, University of Wisconsin-Madison, Madison, WI, USA

Pages 19-21 | Received 10 Feb 2019, Accepted 12 Feb 2019, Published online: 06 Mar 2019

Cite this article
https://doi.org/10.1080/24754269.2019.1583086
CrossMark

Full Article
Figures & data
References
Citations
Metrics
Reprints & Permissions
View PDF PDF View EPUB EPUB

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Professors Bayarri, Berger, Jang, Ray, Pericchi, and Visser deserve a special congratulation for their great work on Bayesian model and variable selection and a pioneering idea of prior-based Bayesian information criterion (PBIC). This work opens the door for contemporary advances in the difficult problem of model and variable selection.

There exist three types of commonly used Bayesian approaches. The first type works on information criterion, such as the well-known BIC. The newly proposed PBIC belongs to this category. The second type includes the indicator model selection (see, e.g., Brown, Vannucci, & Fearn, Citation1998; Dellaportas, Forster, & Ntzoufras, Citation1997; George & McCulloch, Citation1993; Kuo & Mallick, Citation1998; Yuan & Lin, Citation2005), the stochastic search method (e.g., O'Hara & Sillanpää, Citation2009), and the model space method by Green (Citation1995). The third type, which is considered in this discussion, is to apply priors on the regression coefficients that promotes the shrinkage of coefficients towards 0. This last type of approaches is intrinsically connected with frequentist methods in the sense that, first of all, such priors play the same role as the assumption that the coefficients are sparse for the frequentist approach and secondly, in some sense, the Bayesian solution is equivalent to the corresponding frequentist counterpart with a certain penalty parameter. Typical research papers for this type include Griffin and Brown (Citation2009), Park and Casella (Citation2008), and Kyung, Gilly, Ghosh, and Casella (Citation2010).

The shrinkage prior approach may not provide sparse estimates of regression coefficients in general, which could not only complicate the interpretation but also inflate statistical error in analysis. Even without a well-defined variable selection approach, a Bayesian analysis based on a subset of covariates with size considerably less than the original dimensionality, which is referred to as sparse Bayesian analysis, could produce better results than the Bayesian analysis based on all covariates. Several attempts have been made to obtain sparse Bayesian estimates based on shrinkage priors. For instance, Hoti and Sillanpää (Citation2006) proposed a method based on thresholding; however, the method is based on certain approximations, and the choice of the threshold is ad hoc. Another example is the sparse Bayesian learning by Tipping (Citation2001), but it involves complicated nonconvex optimisation and assumes that the variance of the error term is known.

Under the framework of shrinkage priors, we consider a Bayesian variable selection method via a benchmark variable. The benchmark variable serves as a standard to measure the importance of each variable based on the posterior distribution of the corresponding coefficient.

As the first attempt, we focus on linear regression models with normally distributed errors. Let $y$ be an n-dimensional vector of response and, without loss of generality, let $x_{1}, \dots, x_{p}$ be p centralised n-dimensional vectors of predictors or covariates. Conditional on $X = (x_{1}, \dots, x_{p})$ , $y$ is assumed to be multivariate normally distributed as $N (β_{0} 1 + X β, σ^{2} I)$ , where $β$ is a p-dimensional column vector whose jth component is $β_{j}$ , $β_{0}, β_{1}, \dots, β_{p}$ are p+1 unknown parameters, σ is an unknown positive parameter, $1$ is a $n -$ dimensional vector with all components 1 and $I$ is the identity matrix of order n.

Consider the following prior density on $β$ conditioned on $σ^{2}$ , (1) $p (β | σ^{2}) = \prod_{i = 1}^{p} \frac{λ^{1 / δ}}{2^{1 / δ} Γ (1 / δ) σ} \exp (- 2 λ {|\frac{β_{i}}{2 σ}|}^{δ})$ (1) where $λ > 0$ and $1 \leq δ \leq 2$ are hyper-parameters. When $δ = 1$ , this is the Laplace prior which was considered by Park and Casella (Citation2008) for their Bayesian Lasso. When $δ = 2$ , the prior in (Equation1(1) $p (β | σ^{2}) = \prod_{i = 1}^{p} \frac{λ^{1 / δ}}{2^{1 / δ} Γ (1 / δ) σ} \exp (- 2 λ {|\frac{β_{i}}{2 σ}|}^{δ})$ (1) ) is a multivariate normal density and produces the posterior mode of $β$ equivalent to the ridge regression estimate. As the Laplace prior is ‘sharper’ than the Gaussian one, it is expected to yield more sparse predictive models with the potentiality of easier interpretation, which is especially desirable for high-dimensional data with a considerably large amount of noisy variables. However, the posterior inferences associated with Laplace prior involves relatively intensive computation.

For $β_{0}$ and $σ^{2}$ that are not involved with variable selection, we consider noninformative priors so that the overall prior for all parameters is $p (β_{0}, β, σ^{2}) \propto \frac{1}{σ^{2}} p (β | σ^{2})$ If the posterior distribution of $β_{i}$ is nearly the same as that from a noise variable centred at 0, then it is natural to eliminate $x_{i}$ as an unimportant covariate. However, the question is how to quantify whether a posterior distribution to be close to that of a noise.

To illustrate our idea, let us first consider an artificial case where a covariate $z$ exists and is known to have no effect on $y$ . For example, $y$ is distributed as $N (z β_{z} + 1 β_{0} + X β, σ^{2} I)$ with $β_{z} = 0$ . Although we know $z$ is redundant, we still put a prior on $β_{z}$ such that $β_{z}$ and $β_{i}$ are independently identically distributed conditioning on $σ^{2}$ . Under this setting, $x_{i}$ could be treated as an unimportant variable if the posterior of $β_{i}$ is similar to the posterior of $β_{z}$ . In other words, the variable $z$ serves as a benchmark in measuring the importance of $x_{i}$ 's.

A benchmark variable should have a posterior distribution centred at 0 and should not affect the Bayesian analysis concerning $β$ . The question is, how do we find a benchmark variable when we do not have a redundant variable at hand?

We now show that there is a universal solution. Since $X$ is column-wisely centralised, the density of $y$ given $(X, z, β_{0}, β, β_{z}, σ^{2})$ is $\begin{aligned} p (y | X, z, β_{0}, β, β_{z}, σ^{2}) \\ \propto \frac{1}{σ^{n}} \exp (- \frac{{||y - z β_{z} - 1 β_{0} - X β||}^{2}}{2 σ^{2}}) \\ = \frac{1}{σ^{n}} \exp \\ \times (- \frac{\begin{matrix} {||\tilde{y} - X β||}^{2} + {||z - \bar{z} 1||}^{2} β_{z}^{2} - 2 β_{z} z^{'} \\ (\tilde{y} - X β) + n (β_{0} - \bar{y} + β_{z} \bar{z})^{2} \end{matrix}}{2 σ^{2}}) \end{aligned}$ where $\bar{y}$ is the average of components of $y$ , $\bar{z}$ is the average of components of $z$ , $\tilde{y} = y - \bar{y} 1$ , ${||a||}^{2} = a^{'} a$ , and $a^{'}$ is the transpose of $a$ . Under the previously described prior, the joint conditional posterior distribution $p (β_{0}, β, β_{z} | X, z, y, σ^{2})$ can be obtained. Since the intercept $β_{0}$ is not of interest, we integrate it out from $p (β_{0}, β, β_{z} | X, z, y, σ^{2})$ and obtain the conditional posterior density of $(β, β_{z})$ given $(X, z, y, σ^{2})$ as follows: (2) $\begin{aligned} p (β, β_{z} | X, z, y, σ^{2}) \\ \propto \frac{1}{σ^{n + 1}} \\ \times [p (β | σ^{2}) \exp (- \frac{{||\tilde{y} - X β||}^{2} + 2 β_{z} z^{'} X β}{2 σ^{2}})] \\ \times [p (β_{z} | σ^{2}) \exp (- \frac{{||z - \bar{z} 1||}^{2} β_{z}^{2} - 2 z^{'} \tilde{y} β_{z}}{2 σ^{2}})] \end{aligned}$ (2) Note that marginalisation over $β_{0}$ is equivalent to centralising the response $y$ . After the integration, it could be regarded that the posterior inferences are drawn from the centralised response $\tilde{y}$ instead of the original $y$ . The reason that we introduce $β_{0}$ in the model and then integrate it out, instead of eliminating it at the very beginning and directly building a linear regression model as $\tilde{y} = z β_{z} + X β + ϵ$ , is mainly for the mathematical rigorousness, as $\tilde{y}$ is not of full rank and has a degenerate distribution.

The conditional posterior density in (Equation2(2) $\begin{aligned} p (β, β_{z} | X, z, y, σ^{2}) \\ \propto \frac{1}{σ^{n + 1}} \\ \times [p (β | σ^{2}) \exp (- \frac{{||\tilde{y} - X β||}^{2} + 2 β_{z} z^{'} X β}{2 σ^{2}})] \\ \times [p (β_{z} | σ^{2}) \exp (- \frac{{||z - \bar{z} 1||}^{2} β_{z}^{2} - 2 z^{'} \tilde{y} β_{z}}{2 σ^{2}})] \end{aligned}$ (2) ) implies that given $(y, X, z, σ^{2})$ , $β$ and $β_{z}$ are independent if and only if $z^{'} X = 0$ . In other words, $z$ does not affect the posterior of $β$ if and only if $z$ is orthogonal to all $x_{i}$ 's, $i = 1, \dots, p$ . Meanwhile, the posterior of $β_{z}$ is centred at 0 if and only if $z^{'} \tilde{y} = 0$ . Is there a $z$ orthogonal to $(X, \tilde{y})$ ? Clearly, $z = 1$ , the column vector of ones, is a direct solution and could be used as a benchmark to assess the importance of $x_{i}$ 's. When $z = 1$ , the posterior density of $β_{z}$ remains the same as its prior, and the posterior density of $(β, β_{z}, σ^{2})$ is simplified to (3) $\begin{aligned} p (β, β_{z}, σ^{2} | X, y) \\ \propto \frac{1}{σ^{n + 1}} p (β_{z} | σ^{2}) p (β | σ^{2}) \exp (- \frac{{||\tilde{y} - X β||}^{2}}{2 σ^{2}}) \end{aligned}$ (3) The benchmark serves as a measure to assess the importance of each covariate, and therefore provide guidance on variable selection. How to carry out variable selection using posterior (Equation3(3) $\begin{aligned} p (β, β_{z}, σ^{2} | X, y) \\ \propto \frac{1}{σ^{n + 1}} p (β_{z} | σ^{2}) p (β | σ^{2}) \exp (- \frac{{||\tilde{y} - X β||}^{2}}{2 σ^{2}}) \end{aligned}$ (3) ) or extend the ideal to more general settings requires more research. In the rest of this discussion, we consider a real data example.

The prostate cancer data originally came from a research conducted by Stamey et al. (Citation1989), and it was studied by Tibshirani (Citation1996) and Zou and Hastie (Citation2005). The goal of the research was to explore the relation between the level of prostate specific antigen and several clinical measures in men before their hospitalisation for radical prostatectomy. The data frame contains 97 observations and 9 variables. The response is the logarithm of prostate-specific antigen (lpsa), while the 8 covariates are the logarithm of cancer volume (lcavol), logarithm of prostate weight (lweight), age, the logarithm of the amount of benign prostatic hyperplasia (lbph), seminal vesicle invasion (svi), the logarithm of capsular penetration (lcp), Gleason score (gleason) and percentage Gleason score 4 or 5 (pgg45).

Figure visualises the posteriors with Laplace prior ( $δ = 1$ ). Results with normal prior ( $δ = 2$ ) are similar and omitted. In Figure , the leftmost boxplot is based on the posterior samples of the coefficient for the benchmark $z = 1$ . It is distributed almost symmetrically around 0 as expected. Other box plots represent the posterior distributions of the coefficients associated with 8 covariates. It can be seen that the three posteriors plotted in the far right of Figure are clearly different from the posterior of the benchmark and, hence, we conclude that the corresponding three covariates, svi, lweight, and lcavol, are useful for the response. On the other hand, the posteriors of three covariates next to the benchmark in Figure are not different from the benchmark posterior and, hence, the covariates pgg45, lcp, and gleason are not useful. The posteriors of lbph and age are just marginally different from that of the benchmark, and we still consider them to be not useful covariates.

Figure 1. Posterior plots on the prostate cancer data.

Figure also includes Lasso and Bayesian Lasso estimates of each coefficients, marked as circles and squares in the figure. The Lasso estimates are zero for pgg45, lcp, and age, nonzero for the other 5 covariates. Thus, the Lasso approach agrees with our approach for covariates pgg45, lcp, age, svi, lweight, and lcavol, but does not agree on gleason and lbph. Since the magnitudes of Lasso estimates for gleason and lbph are small, another thresholding added to Lasso will result in the same conclusion with ours. Meanwhile, the Bayesian Lasso evaluates all the coefficients to be nonzero as it doesn't select variables to promote model sparsity.

Disclosure statement

No potential conflict of interest was reported by the authors.

References

Brown, P. J., Vannucci, M., & Fearn, T. (1998). Multivariate Bayesian variable selection and prediction. Journal of the Royal Statistical Society, Series B, 60, 627–641. doi: 10.1111/1467-9868.00144
Google Scholar
Dellaportas, P., Forster, J. J., & I., Ntzoufras (1997). On Bayesian model and variable selection using MCMC (Technical Report). Athens: Department of Statistics, Athens University of Economics and Business.
Google Scholar
George, E. I., & McCulloch, R. E. (1993). Variable selection via Gibbs sampling. Journal of the American Statistical Association, 85, 398–409.
Google Scholar
Green, P. J. (1995). Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika, 82, 711–732. doi: 10.1093/biomet/82.4.711
Web of Science ®Google Scholar
Griffin, J. E., & Brown, P. J. (2009). Inference with Normal-Gamma prior distributions in regression problems (Technical Report). Institute of Mathematics, Statistics and Actuarial Science, University of Kent.
Google Scholar
Hoti, F., & Sillanpää, M. J. (2006). Bayesian mapping of genotype x expression interactions in quantitative and qualitative traits. Heredity, 97, 4–18. doi: 10.1038/sj.hdy.6800817
PubMed Web of Science ®Google Scholar
Kuo, L., & Mallick, B. (1998). Variable selection for regression models. Sankhya Series B, 60, 65–81.
Google Scholar
Kyung, M., Gilly, J., Ghosh, M., & Casella, G. (2010). Penalized regression, standard errors, and Bayesian Lassos. Bayesian Analysis, 5, 369–412. doi: 10.1214/10-BA607
Web of Science ®Google Scholar
O'Hara, R. B., & Sillanpää, M. J. (2009). Review of Bayesian variable selection methods: What, how and which. Bayesian Analysis, 4, 85–118. doi: 10.1214/09-BA403
Web of Science ®Google Scholar
Park, T., & Casella, G. (2008). The Bayesian Lasso. Journal of the American Statistical Association, 103, 681–686. doi: 10.1198/016214508000000337
Web of Science ®Google Scholar
Stamey, T., Kabalin, J., McNeal, J., Johnstone, I., Freiha, F., Redwine, E., & Yang, N. (1989). Prostate specific antigen in the diagnosis and treatment of adenocarcinoma of the prostate. II. Radical prostatectomy treated patients. Journal of Urology, 141, 1076–1083. doi: 10.1016/S0022-5347(17)41175-X
PubMed Web of Science ®Google Scholar
Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society, Series B, 58, 267–288.
Google Scholar
Tipping, M. (2001). Sparse Bayesian learning and the relevance vector machine. Journal of Machine Learning, 1, 211–244.
Web of Science ®Google Scholar
Yuan, M., & Lin, Y. (2005). Efficient empirical Bayes variable selection and estimation in linear models. Journal of the American Statistical Association, 100, 1215–1225. doi: 10.1198/016214505000000367
Web of Science ®Google Scholar
Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, Series B, 67, 301–320. doi: 10.1111/j.1467-9868.2005.00503.x
Google Scholar

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Order Reprints Request Corporate Permissions

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

Request Academic Permissions

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.

Download PDF

Share icon
Back to Top

Related research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.

People also read
Recommended articles
Cited by

To cite this article:

Reference style: APA Chicago Harvard

Citation copied to clipboard

Reference styles above use APA (6th edition), Chicago (16th edition) & Harvard (10th edition)

Download citation

Download a citation file in RIS format that can be imported by citation management software including EndNote, ProCite, RefWorks and Reference Manager.

Choose format: RIS BibTex RefWorks Direct Export

Choose options: Citation Citation & abstract Citation & references

Your download is now in progress and you may close this window

Did you know that with a free Taylor & Francis Online account you can gain access to the following benefits?

Choose new content alerts to be informed about new research of interest to you
Easy remote access to your institution's subscriptions on any device, from any location
Save your searches and schedule alerts to send you new results
Export your search results into a .csv file to support your research

Have an account?
Login now Don't have an account?
Register for free

Login or register to access this feature

Have an account?
Login now Don't have an account?
Register for free

Choose new content alerts to be informed about new research of interest to you
Easy remote access to your institution's subscriptions on any device, from any location
Save your searches and schedule alerts to send you new results
Export your search results into a .csv file to support your research

A discussion of ‘prior-based Bayesian information criterion (PBIC)’

Disclosure statement

References

Information for

Open access

Opportunities

Help and information

A discussion of ‘prior-based Bayesian information criterion (PBIC)’

Disclosure statement

References

Reprints and Corporate Permissions

Academic Permissions

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date