330
Views
3
CrossRef citations to date
0
Altmetric
Bayesian and Latent Variable Models

Scalable Hyperparameter Selection for Latent Dirichlet Allocation

&
Pages 875-895 | Received 20 Dec 2018, Accepted 02 Dec 2019, Published online: 15 May 2020
 

Abstract

Latent Dirichlet allocation (LDA) is a heavily used Bayesian hierarchical model used in machine learning for modeling high-dimensional sparse count data, for example, text documents. As a Bayesian model, it incorporates a prior on a set of latent variables. The prior is indexed by some hyperparameters, which have a big impact on inference regarding the model. The ideal estimate of the hyperparameters is the empirical Bayes estimate which is, by definition, the maximizer of the marginal likelihood of the data with all the latent variables integrated out. This estimate cannot be obtained analytically. In practice, the hyperparameters are chosen either in an ad-hoc manner, or through some variants of the EM algorithm for which the theoretical basis is weak. We propose an MCMC-based fully Bayesian method for obtaining the empirical Bayes estimate of the hyperparameter. We compare our method with other existing approaches both on synthetic and real data. The comparative experiments demonstrate that the LDA model with hyperparameters specified by our method outperforms models with the hyperparameters estimated by other methods. Supplementary materials for this article are available online.

Supplementary Materials

R code and data The supplemental files for this article include files containing R code and data for reproducing all the empirical studies in the article. The Readme file contained in the zip file gives a description of all the other files in the archive. (shs-lda-code.zip, zip archive)

Appendix The supplemental files include a document which gives the following: (i) a review of Hamiltonian Monte Carlo, (ii) a section showing feasibility of our method on large corpora, (iii) proofs of Theorems 1–3, and (iv) some minor theoretical details. (shs-lda-supp.pdf)

Acknowledgments

We are grateful to the two referees for their helpful constructive criticism.

Additional information

Funding

Research supported by NSF grants DIIS-17-24174 and DMS-1854476.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.