ABSTRACT
One of the core problems of modern statistics is to approximate difficult-to-compute probability densities. This problem is especially important in Bayesian statistics, which frames all inference about unknown quantities as a calculation involving the posterior density. In this article, we review variational inference (VI), a method from machine learning that approximates probability densities through optimization. VI has been used in many applications and tends to be faster than classical methods, such as Markov chain Monte Carlo sampling. The idea behind VI is to first posit a family of densities and then to find a member of that family which is close to the target density. Closeness is measured by Kullback–Leibler divergence. We review the ideas behind mean-field variational inference, discuss the special case of VI applied to exponential family models, present a full example with a Bayesian mixture of Gaussians, and derive a variant that uses stochastic optimization to scale up to massive data. We discuss modern research in VI and highlight important open problems. VI is powerful, but it is not yet well understood. Our hope in writing this article is to catalyze statistical research on this class of algorithms. Supplementary materials for this article are available online.
Supplementary Materials
The online supplementary materials contain the appendices for the article.
Notes
1 We focus here on -based optimization, also called Kullback–Leibler variational inference (Barber Citation2012). Wainwright and Jordan (Citation2008) emphasized that any procedure that uses optimization to approximate a density can be termed “variational inference.” This includes methods like expectation propagation (Minka Citation2001), belief propagation (Yedidia, Freeman, and Weiss Citation2001), or even the Laplace approximation. We briefly discuss alternative divergence measures in Section 5.
2 The KL divergence is an information-theoretical measure of proximity between two densities. It is asymmetric—that is, —and nonnegative. It is minimized when q( · ) = p( · ).
3 Two notes: (a) Variational EM is the EM algorithm with a variational E-step, that is, a computation of an approximate conditional. (b) The coordinate ascent algorithm of Section 2.4 resembles the EM algorithm. The “E step” computes approximate conditionals of local latent variables; the “M step” computes a conditional of the global latent variables.
4 Many readers will know that we can significantly speed up the Gibbs sampler by marginalizing out some of the latent variables; this is called collapsed Gibbs sampling. We can speed up variational inference with similar reasoning; this is called collapsed variational inference. It has been developed for the same class of models described here (Sung, Ghahramani, and Bang Citation2008; Hensman, Rattray, and Lawrence Citation2012). These ideas are outside the scope of our review.
5 This is not a definitive comparison between variational inference and MCMC. Other samplers, such as a collapsed Gibbs sampler, may perform better than Hamiltonian Monte Carlo sampling.