Abstract.
Empirical studies of natural language have demonstrated that word frequencies follow power law distributions. However, standard statistical models often fail to capture this property. The Pitman-Yor process (PYP), a Bayesian non parametric model capable of generating power law distributions, has been widely used in probabilistic topic models to handle data with an infinite number of components. However, existing PYP topic models rarely account for the relationships between topics. Hidden Markov models (HMMs) are popular models for modeling topic relationships. To address this limitation, we propose a probabilistic topic model that combines HMM with Pitman-Yor priors. The posterior inference was performed by using variational Bayes methods. We applied our method to text categorization and compared it with two related topic models: the hidden Markov topic model and hierarchical PYP topic model.
Acknowledgments
The authors thank the editor, the associate editor, and a referee for their constructive comments and suggestions that helped to improve the paper.
Disclosure statement
No potential conflict of interest was reported by the author(s).