3,542
Views
84
CrossRef citations to date
0
Altmetric
Applications and Case Studies

Improving and Evaluating Topic Models and Other Models of Text

&
Pages 1381-1403 | Received 01 Sep 2012, Published online: 04 Jan 2017
 

ABSTRACT

An ongoing challenge in the analysis of document collections is how to summarize content in terms of a set of inferred themes that can be interpreted substantively in terms of topics. The current practice of parameterizing the themes in terms of most frequent words limits interpretability by ignoring the differential use of words across topics. Here, we show that words that are both frequent and exclusive to a theme are more effective at characterizing topical content, and we propose a regularization scheme that leads to better estimates of these quantities. We consider a supervised setting where professional editors have annotated documents to topic categories, organized into a tree, in which leaf-nodes correspond to more specific topics. Each document is annotated to multiple categories, at different levels of the tree. We introduce a hierarchical Poisson convolution model to analyze these annotated documents. A parallelized Hamiltonian Monte Carlo sampler allows the inference to scale to millions of documents. The model leverages the structure among categories defined by professional editors to infer a clear semantic description for each topic in terms of words that are both frequent and exclusive. In this supervised setting, we validate the efficacy of word frequency and exclusivity at characterizing topical content on two very large collections of documents, from Reuters and the New York Times. In an unsupervised setting, we then consider a simplified version of the model that shares the same regularization scheme with the previous model. We carry out a large randomized experiment on Amazon Mechanical Turk to demonstrate that topic summaries based on frequency and exclusivity, estimated using the proposed regularization scheme, are more interpretable than currently established frequency-based summaries, and that the proposed model produces more efficient estimates of exclusivity than the currently established models.

Funding

This work was partially supported by the National Science Foundation under grants CAREER IIS-1149662 and IIS-1409177, by the Army Research Office grant MURI W911NF-11-1-0036, and by the Office of Naval Research under grant YIP N00014-14-1-0485. Edoardo M. Airoldi is an Alfred P. Sloan Research Fellow, and a Shutzer Fellow at the Radcliffe Institute for Advanced Studies.

Notes

1 This is where the model’s name arises: the observed feature count in each document is the convolution of (unobserved) topic-specific Poisson variates.

2 In practice this precision matrix can be found easily as the negative Hessian of the log-prior distribution.

3 Available upon request from the National Institute of Standards and Technology (NIST), http://trec.nist.gov/data/reuters/reuters.html.

4 Including rarer features did not meaningfully change the results.

Log in via your institution

Log in to Taylor & Francis Online

PDF download + Online access

  • 48 hours access to article PDF & online version
  • Article PDF can be downloaded
  • Article PDF can be printed
USD 61.00 Add to cart

Issue Purchase

  • 30 days online access to complete issue
  • Article PDFs can be downloaded
  • Article PDFs can be printed
USD 343.00 Add to cart

* Local tax will be added as applicable

Related Research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.