1,331
Views
3
CrossRef citations to date
0
Altmetric
Theory and Methods

Using SVD for Topic Modeling

ORCID Icon &
Pages 434-449 | Received 10 Nov 2021, Accepted 07 Sep 2022, Published online: 10 Oct 2022
 

Abstract

The probabilistic topic model imposes a low-rank structure on the expectation of the corpus matrix. Therefore, singular value decomposition (SVD) is a natural tool of dimension reduction. We propose an SVD-based method for estimating a topic model. Our method constructs an estimate of the topic matrix from only a few leading singular vectors of the data matrix, and has a great advantage in memory use and computational cost for large-scale corpora. The core ideas behind our method include a pre-SVD normalization to tackle severe word frequency heterogeneity, a post-SVD normalization to create a low-dimensional word embedding that manifests a simplex geometry, and a post-SVD procedure to construct an estimate of the topic matrix directly from the embedded word cloud. We provide the explicit rate of convergence of our method. We show that our method attains the optimal rate in the case of long and moderately long documents, and it improves the rates of existing methods in the case of short documents. The key of our analysis is a sharp row-wise large-deviation bound for empirical singular vectors, which is technically demanding to derive and potentially useful for other problems. We apply our method to a corpus of Associated Press news articles and a corpus of abstracts of statistical papers. Supplementary materials for this article are available online.

Supplementary Materials

The supplementary material contains the pseudo-code of two vertex hunting algorithms, some real data results not included in the main paper, and the proofs of all theorems and lemmas.

Acknowledgments

The authors thank the Associate Editor and two anonymous referees for helpful comments. The authors thank Jiashun Jin and John Lafferty for reading an early draft of the article and giving many useful comments. The authors thank Pengsheng Ji for sharing the SLA data. Z. Ke thanks Art Owen for useful comments on the real data results. Z. Ke also thanks Rina Barber, Chao Gao and John Lafferty for helpful discussions in the HELIOS reading group, which inspired her to work on topic modeling.

Data Availability Statement

Data and code for reproducing the numerical results of this article can be found at GitHub (https://github.com/ZhengTracyKe/TopicSCORE).

Notes

1 Multinomial(N, v) denotes the multinomial distribution with N being the number of trials and v being the vector of event probabilities.

2 We modify π̂j* to π̂j , to get an eligible weight vector. Note that π̂j differs from π̂j* only if r̂j is outside the estimated simplex. The fraction of such r̂j ’s is small.

Additional information

Funding

The research of Z. Ke is partially supported by the NSF CAREER grant DMS-1943902.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.