Abstract
The probabilistic topic model imposes a low-rank structure on the expectation of the corpus matrix. Therefore, singular value decomposition (SVD) is a natural tool of dimension reduction. We propose an SVD-based method for estimating a topic model. Our method constructs an estimate of the topic matrix from only a few leading singular vectors of the data matrix, and has a great advantage in memory use and computational cost for large-scale corpora. The core ideas behind our method include a pre-SVD normalization to tackle severe word frequency heterogeneity, a post-SVD normalization to create a low-dimensional word embedding that manifests a simplex geometry, and a post-SVD procedure to construct an estimate of the topic matrix directly from the embedded word cloud. We provide the explicit rate of convergence of our method. We show that our method attains the optimal rate in the case of long and moderately long documents, and it improves the rates of existing methods in the case of short documents. The key of our analysis is a sharp row-wise large-deviation bound for empirical singular vectors, which is technically demanding to derive and potentially useful for other problems. We apply our method to a corpus of Associated Press news articles and a corpus of abstracts of statistical papers. Supplementary materials for this article are available online.
Supplementary Materials
The supplementary material contains the pseudo-code of two vertex hunting algorithms, some real data results not included in the main paper, and the proofs of all theorems and lemmas.
Acknowledgments
The authors thank the Associate Editor and two anonymous referees for helpful comments. The authors thank Jiashun Jin and John Lafferty for reading an early draft of the article and giving many useful comments. The authors thank Pengsheng Ji for sharing the SLA data. Z. Ke thanks Art Owen for useful comments on the real data results. Z. Ke also thanks Rina Barber, Chao Gao and John Lafferty for helpful discussions in the HELIOS reading group, which inspired her to work on topic modeling.
Data Availability Statement
Data and code for reproducing the numerical results of this article can be found at GitHub (https://github.com/ZhengTracyKe/TopicSCORE).
Notes
1 Multinomial(N, v) denotes the multinomial distribution with N being the number of trials and v being the vector of event probabilities.
2 We modify to , to get an eligible weight vector. Note that differs from only if is outside the estimated simplex. The fraction of such ’s is small.