Search in:

Advanced search

Journal of the American Statistical Association Volume 111, 2016 - Issue 516

Submit an article Journal homepage

3,542

Views

CrossRef citations to date

Altmetric

Applications and Case Studies

Improving and Evaluating Topic Models and Other Models of Text

Edoardo M. AiroldiDepartment of Statistics, Harvard University, Cambridge, MA, USACorrespondence[email protected]

Jonathan M. BischofGoogle, San Francisco, CA, USA

Pages 1381-1403 | Received 01 Sep 2012, Published online: 04 Jan 2017

Cite this article
https://doi.org/10.1080/01621459.2015.1051182
CrossMark

Full Article
Figures & data
References
Citations
Metrics
Reprints & Permissions

References

Adams, R. P., Ghahramani, Z., and Jordan, M. I. (2010), “Tree-Structured Stick Breaking for Hierarchical Data,” in Advances in Neural Information Processing Systems (NIPS) 23, pp. 19–27.
Google Scholar
Airoldi, E. M., Anderson, A. G., Fienberg, S. E., and Skinner, K. K. (2006), “Who Wrote Ronald Reagan’s Radio Addresses?” Bayesian Analysis, 1, 289–320.
Web of Science ®Google Scholar
Airoldi, E. M., Blei, D. M., Erosheva, E. A., and Fienberg, S. E. (eds.) (2014), Handbook of Mixed Membership Models and Their Applications, Boca Raton, FL: Chapman & Hall/CRC Press.
Google Scholar
Airoldi, E. M., Blei, D. M., Fienberg, S., and Xing, E. (2008), “Mixed-Membership Stochastic Blockmodels,” Journal of Machine Learning Research, 9, 1981–2014.
PubMed Web of Science ®Google Scholar
Airoldi, E. M., Erosheva, E. A., Fienberg, S. E., Joutard, C. J., Love, T. M., and Shringarpure, S. (2010), “Reconceptualizing the Classification of PNAS Articles,” Proceedings of the National Academy of Sciences, 107, 20899–20904.
PubMed Web of Science ®Google Scholar
Airoldi, E. M., Fienberg, S. E., and Skinner, K. K. (2007a), “Whose Ideas? Whose Words? Authorship of the Ronald Reagan Radio Addresses,” Political Science & Politics, 40, 501–506.
Web of Science ®Google Scholar
Airoldi, E. M., Fienberg, S. E., and Xing, E. P. (2007b), “Mixed Membership Analysis of Genome-Wide Expression Studies—Attribute Data,” arXiv no. 0711.2520.
Google Scholar
Aletras, N., and Stevenson, M. (2013), Evaluating Topic Coherence Using Distributional Semantics, in IWCS, number 2009, Shrewsbury, PA: ICWS.
Google Scholar
Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H., Cherry, J. M., Davis, A. P., Dolinski, K., Dwight, S. S., Eppig, J. T., Harris, M. A., Hill, D. P., Issel-Tarver, L., Kasarskis, A., Lewis, S., Matese, J. C., Richardson, J. E., Ringwald, M., Rubinand, G. M., and Sherlock, G. (2000), “Gene Ontology: Tool for the Unification of Biology. The Gene Ontology Consortium,” Nature Genetics, 25, 25–29.
PubMed Web of Science ®Google Scholar
Bakalov, A., McCallum, A., Wallach, H., and Mimno, D. (2012), “Topic Models for Taxonomies,” in Proceedings of the 12th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 237–240.
Google Scholar
Blei, D. (2012), “Introduction to Probabilistic Topic Models,” Communications of the ACM, 55, 77–84.
Web of Science ®Google Scholar
Blei, D., Griffiths, T., Jordan, M., and Tenenbaum, J. (2003), “Hierarchical Topic Models and the Nested Chinese Restaurant Process,” in NIPS 16, Cambridge, MA: MIT Press, pp. 17–24.
Google Scholar
Blei, D., and McAuliffe, J. (2010), “Supervised Topic Models,” arXiv:1003.0783.
Google Scholar
Blei, D., Ng, A., and Jordan, M. (2003), “Latent Dirichlet Allocation,” Journal of Machine Learning Research, 3, 993–1022.
Web of Science ®Google Scholar
Breiman, L. (2001), “Statistical Modeling: The Two Cultures,” Statistical Science, 16, 199–231.
Web of Science ®Google Scholar
Buntine, W., and Jakulin, A. (2006), “Discrete Components Analysis,” in Subspace, Latent Structure and Feature Selection, volume 3940 of Lecture Notes in Computer Science, Berlin: Springer, pp. 1–33.
Google Scholar
Canny, J. (2004), “GAP: A Factor Model for Discrete Data,” in Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 122–129.
Google Scholar
Chang, J., Boyd-Graber, J., Gerrish, S., Wang, C., and Blei, D. (2009), “Reading Tea Leaves: How Humans Interpret Topic Models,” in Advances in Neural Information Processing Systems 22, pp. 288–296.
Google Scholar
Eisenstein, J., Ahmed, A., and Xing, E. P. (2011), “Sparse Additive Generative Models of Text,” in Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 1041–1048.
Google Scholar
Harman, D. (1992), “Overview of the First Text Retrieval Conference (TREC-1),” in Proceedings of the First Text Retrieval Conference (TREC-1), pp. 1–20.
Google Scholar
Hotelling, H. (1936), “Relations Between Two Sets of Variants,” Biometrika, 28, 321–377.
Web of Science ®Google Scholar
Hu, Y., Boyd-Graber, J., and Satinoff, B. (2011), “Interactive Topic Modeling,” in Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, pp. 248–257.
Google Scholar
Jia, J., Miratrix, L., Yu, B., Gawalt, B., El Ghaoui, L., Barnesmoore, L., and Clavier, S. (2014), “Concise Comparative Summaries (CCS) of Large Text Corpora With a Human Experiment,” Annals of Applied Statistics, 8, 499–529.
Web of Science ®Google Scholar
Jolliffe, I. T. (1986), Principal Component Analysis, New York: Springer-Verlag.
Google Scholar
Kanehisa, M., and Goto, S. (2000), “KEGG: Kyoto Encyclopedia of Genes and Genomes,” Nucleic Acids Research, 28, 27–30.
PubMed Web of Science ®Google Scholar
Lewis, D. D., Yang, Y., Rose, T. G., and Li, F. (2004), “RCV1: A New Benchmark Collection for Text Categorization Research,” Journal of Machine Learning Research, 5, 361–397.
Web of Science ®Google Scholar
Liu, J. S., and Wu, Y. N. (1999), “Parameter Expansion for Data Augmentation,” Journal of the American Statistical Association, 94, 1264–1274.
Web of Science ®Google Scholar
McCallum, A., Rosenfeld, R., Mitchell, T., and Ng, A. (1998), “Improving Text Classification by Shrinkage in a Hierarchy of Classes,” in Proceedings of the 15th International Conference on Machine Learning, pp. 359–367.
Google Scholar
McLachlan, G., and Peel, D. (2000), Finite Mixture Models, New York: Wiley.
Google Scholar
Meng, X., and Rubin, D. B. (1991), “Using EM to Obtain Asymptotic Variance-Covariance Matrices: The SEM Algorithm,” Journal of the American Statistical Association, 86, 899–909.
Web of Science ®Google Scholar
Mimno, D., Li, W., and McCallum, A. (2007), “Mixtures of Hierarchical Topics With Pachinko Allocation,” in Proceedings of the 24th International Conference on Machine Learning, pp. 633–640.
Google Scholar
Mimno, D., Wallach, H., Talley, E., Leenders, M., and McCallum, A. (2011), “Optimizing Semantic Coherence in Topic Models,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 262–272.
Google Scholar
Mosteller, F., and Wallace, D. (1964), Inference and Disputed Authorship: The Federalist, Reading, MA: Addison-Wesley.
Google Scholar
Mosteller, F., and Wallace, D. (1984), Applied Bayesian and Classical Inference: The Case of “The Federalist” Papers, New York: Springer-Verlag.
Google Scholar
Neal, R. (2011), “MCMC using Hamiltonian Dynamics,” in Handbook of Markov Chain Monte Carlo, eds. S. Brooks, A. Gelman, G. L. Jones, and X.-L. Meng, Boca Raton, FL: Chapman & Hall/CRC Press, pp. 113–162.
Google Scholar
Newman, D., Lau, J. H., Grieser, K., and Baldwin, T. (2010), “Automatic Evaluation of Topic Coherence,” in Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 100–108.
Google Scholar
Nigam, K., McCallum, A., Thrun, S., and Mitchell, T. (2000), “Text Classification From Labeled and Unlabeled Documents Using EM,” Machine Learning, 39, 103–134.
Web of Science ®Google Scholar
Perotte, A., Bartlett, N., Elhadad, N., and Wood, F. (2012), “Hierarchically Supervised Latent Dirichlet Allocation,” in Advances in Neural Information Processing Systems 24, pp. 2609–2617.
Google Scholar
Ramage, D., Hall, D., Nallapati, R., and Manning, C. D. (2009), “Labeled LDA: A Supervised Topic Model for Credit Attribution in Multi-Labeled Corpora,” in Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pp. 248–256.
Google Scholar
Rubin, T., Chambers, A., Smyth, P., and Steyvers, M. (2012), “Statistical Topic Models for Multi-Label Document Classification,” Machine Learning, 88, 157–208.
Web of Science ®Google Scholar
Sandhaus, E. (2008), The New York Times Annotated Corpus, Philadelphia,PA: Linguistic Data Consortium.
Google Scholar
Sohn, K., and Xing, E. P. (2009), “A Hierarchical Dirichlet Process Mixture Model for Haplotype Reconstruction From Multi-Population Data,” Annals of Applied Statistics, 3, 791–821.
Web of Science ®Google Scholar
Wallach, H., Mimno, D., and McCallum, A. (2009), “Rethinking LDA: Why Priors Matter,” in Advances in Neural Information Processing Systems 22, pp. 1973–1981.
Google Scholar
Zhu, J., Ahmed, A., and Xing, E. P. (2012), “Medlda: Maximum Margin Supervised Topic Models,” Journal of Machine Learning Research, 13, 2237–2278.
Web of Science ®Google Scholar
Zhu, J., and Xing, E. P. (2012), “Sparse Topical Coding,” arXiv:1202.3778.
Google Scholar

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Order Reprints Request Corporate Permissions

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

Request Academic Permissions

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.

Share icon
Back to Top

Related research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.

People also read
Recommended articles
Cited by

To cite this article:

Reference style: APA Chicago Harvard

Citation copied to clipboard

Reference styles above use APA (6th edition), Chicago (16th edition) & Harvard (10th edition)

Download citation

Download a citation file in RIS format that can be imported by citation management software including EndNote, ProCite, RefWorks and Reference Manager.

Choose format: RIS BibTex RefWorks Direct Export

Choose options: Citation Citation & abstract Citation & references

Improving and Evaluating Topic Models and Other Models of Text

References

Information for

Open access

Opportunities

Help and information

Your download is now in progress and you may close this window

Login or register to access this feature

Improving and Evaluating Topic Models and Other Models of Text

References

Reprints and Corporate Permissions

Academic Permissions

Related research

To cite this article:

Download citation

Information for

Open access

Opportunities

Help and information

Keep up to date