Abstract
Latent Dirichlet allocation (LDA) is a heavily used Bayesian hierarchical model used in machine learning for modeling high-dimensional sparse count data, for example, text documents. As a Bayesian model, it incorporates a prior on a set of latent variables. The prior is indexed by some hyperparameters, which have a big impact on inference regarding the model. The ideal estimate of the hyperparameters is the empirical Bayes estimate which is, by definition, the maximizer of the marginal likelihood of the data with all the latent variables integrated out. This estimate cannot be obtained analytically. In practice, the hyperparameters are chosen either in an ad-hoc manner, or through some variants of the EM algorithm for which the theoretical basis is weak. We propose an MCMC-based fully Bayesian method for obtaining the empirical Bayes estimate of the hyperparameter. We compare our method with other existing approaches both on synthetic and real data. The comparative experiments demonstrate that the LDA model with hyperparameters specified by our method outperforms models with the hyperparameters estimated by other methods. Supplementary materials for this article are available online.
Supplementary Materials
R code and data The supplemental files for this article include files containing R code and data for reproducing all the empirical studies in the article. The Readme file contained in the zip file gives a description of all the other files in the archive. (shs-lda-code.zip, zip archive)
Appendix The supplemental files include a document which gives the following: (i) a review of Hamiltonian Monte Carlo, (ii) a section showing feasibility of our method on large corpora, (iii) proofs of Theorems 1–3, and (iv) some minor theoretical details. (shs-lda-supp.pdf)
Acknowledgments
We are grateful to the two referees for their helpful constructive criticism.