ABSTRACT
An ongoing challenge in the analysis of document collections is how to summarize content in terms of a set of inferred themes that can be interpreted substantively in terms of topics. The current practice of parameterizing the themes in terms of most frequent words limits interpretability by ignoring the differential use of words across topics. Here, we show that words that are both frequent and exclusive to a theme are more effective at characterizing topical content, and we propose a regularization scheme that leads to better estimates of these quantities. We consider a supervised setting where professional editors have annotated documents to topic categories, organized into a tree, in which leaf-nodes correspond to more specific topics. Each document is annotated to multiple categories, at different levels of the tree. We introduce a hierarchical Poisson convolution model to analyze these annotated documents. A parallelized Hamiltonian Monte Carlo sampler allows the inference to scale to millions of documents. The model leverages the structure among categories defined by professional editors to infer a clear semantic description for each topic in terms of words that are both frequent and exclusive. In this supervised setting, we validate the efficacy of word frequency and exclusivity at characterizing topical content on two very large collections of documents, from Reuters and the New York Times. In an unsupervised setting, we then consider a simplified version of the model that shares the same regularization scheme with the previous model. We carry out a large randomized experiment on Amazon Mechanical Turk to demonstrate that topic summaries based on frequency and exclusivity, estimated using the proposed regularization scheme, are more interpretable than currently established frequency-based summaries, and that the proposed model produces more efficient estimates of exclusivity than the currently established models.
Funding
This work was partially supported by the National Science Foundation under grants CAREER IIS-1149662 and IIS-1409177, by the Army Research Office grant MURI W911NF-11-1-0036, and by the Office of Naval Research under grant YIP N00014-14-1-0485. Edoardo M. Airoldi is an Alfred P. Sloan Research Fellow, and a Shutzer Fellow at the Radcliffe Institute for Advanced Studies.
Notes
1 This is where the model’s name arises: the observed feature count in each document is the convolution of (unobserved) topic-specific Poisson variates.
2 In practice this precision matrix can be found easily as the negative Hessian of the log-prior distribution.
3 Available upon request from the National Institute of Standards and Technology (NIST), http://trec.nist.gov/data/reuters/reuters.html.
4 Including rarer features did not meaningfully change the results.
5 This list is available at www.jmlr.org/papers/volume5/lewis04a/a11-smart-stop-list/english.stop.