45,319
Views
1,958
CrossRef citations to date
0
Altmetric
Review

Variational Inference: A Review for Statisticians

, &
Pages 859-877 | Received 01 Jan 2016, Published online: 13 Jul 2017
 

ABSTRACT

One of the core problems of modern statistics is to approximate difficult-to-compute probability densities. This problem is especially important in Bayesian statistics, which frames all inference about unknown quantities as a calculation involving the posterior density. In this article, we review variational inference (VI), a method from machine learning that approximates probability densities through optimization. VI has been used in many applications and tends to be faster than classical methods, such as Markov chain Monte Carlo sampling. The idea behind VI is to first posit a family of densities and then to find a member of that family which is close to the target density. Closeness is measured by Kullback–Leibler divergence. We review the ideas behind mean-field variational inference, discuss the special case of VI applied to exponential family models, present a full example with a Bayesian mixture of Gaussians, and derive a variant that uses stochastic optimization to scale up to massive data. We discuss modern research in VI and highlight important open problems. VI is powerful, but it is not yet well understood. Our hope in writing this article is to catalyze statistical research on this class of algorithms. Supplementary materials for this article are available online.

Supplementary Materials

The online supplementary materials contain the appendices for the article.

Notes

1 We focus here on KL(q||p)-based optimization, also called Kullback–Leibler variational inference (Barber Citation2012). Wainwright and Jordan (Citation2008) emphasized that any procedure that uses optimization to approximate a density can be termed “variational inference.” This includes methods like expectation propagation (Minka Citation2001), belief propagation (Yedidia, Freeman, and Weiss Citation2001), or even the Laplace approximation. We briefly discuss alternative divergence measures in Section 5.

2 The KL divergence is an information-theoretical measure of proximity between two densities. It is asymmetric—that is, kl(qp)kl(pq)—and nonnegative. It is minimized when q( · ) = p( · ).

3 Two notes: (a) Variational EM is the EM algorithm with a variational E-step, that is, a computation of an approximate conditional. (b) The coordinate ascent algorithm of Section 2.4 resembles the EM algorithm. The “E step” computes approximate conditionals of local latent variables; the “M step” computes a conditional of the global latent variables.

4 Many readers will know that we can significantly speed up the Gibbs sampler by marginalizing out some of the latent variables; this is called collapsed Gibbs sampling. We can speed up variational inference with similar reasoning; this is called collapsed variational inference. It has been developed for the same class of models described here (Sung, Ghahramani, and Bang Citation2008; Hensman, Rattray, and Lawrence Citation2012). These ideas are outside the scope of our review.

5 This is not a definitive comparison between variational inference and MCMC. Other samplers, such as a collapsed Gibbs sampler, may perform better than Hamiltonian Monte Carlo sampling.

Log in via your institution

Log in to Taylor & Francis Online

PDF download + Online access

  • 48 hours access to article PDF & online version
  • Article PDF can be downloaded
  • Article PDF can be printed
USD 61.00 Add to cart

Issue Purchase

  • 30 days online access to complete issue
  • Article PDFs can be downloaded
  • Article PDFs can be printed
USD 343.00 Add to cart

* Local tax will be added as applicable

Related Research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.