67
Views
1
CrossRef citations to date
0
Altmetric
Articles

Systematic framework for short text classification based on improved TWE and supervised MCFS topic merging strategy

ORCID Icon, ORCID Icon, ORCID Icon &
Pages 401-413 | Received 14 Apr 2019, Accepted 23 Apr 2020, Published online: 06 May 2020
 

Abstract

Text classification task can help people to discover valuable information hidden in the text set. Many previous studies have achieved excellent results in traditional text classification tasks. With the development of new social media, a large number of short texts appear on the Internet. Due to the sparsity of the short text, many classification algorithms which achieve excellent results on long texts hardly achieve satisfactory results on short texts. Therefore, it is important to find a method for calculating effective vector representations of words and overcoming the feature sparseness problem. Based on the above, we carry out the work from improving the quality of word vector representation and enhancing the effect of classification. This paper proposes a systematic framework for improving short text classification performance. In our framework, we first build a topic model with Latent Dirichlet Allocation (LDA) on a universal dataset from Wikipedia and use this model to perform topic inference on short texts. Then, an improved scheme of topical word embedding (TWE) is proposed to learn the vector representations of both words and topics, which use the word in the current word-topic pair to predict the contextual words and the topic in the same word-topic pair to predict its contextual topics. In addition, the supervised Multi-Cluster Feature Selection algorithm (MCFS) is employed to execute topic selection, and we propose a topic merging strategy that is based on the MCFS. At the end of topic selection and merging, short text matrices are generated using the vector representations of both words and topics, and these matrices are fed into a convolution neural network (CNN). On an open short text classification dataset, we compared the proposed framework with various baselines, and the experimental results indicate the effectiveness of our method.

Acknowledgments

This research was supported by the National Natural Science Foundation of China (No.51378350, No.61173032) and the Science and Technology Commissioner Project of Tianjin (No.15JCTPJC58100).

Data Availability

The experimental data used or analyzed during the current study are available from the corresponding author on reasonable request.

Disclosure statement

No potential conflict of interest was reported by the authors.

Notes

Additional information

Funding

This work was supported by National Natural Science Foundation of China: [Grant Number No.51378350, No.61173032]].

Notes on contributors

Baoshan Sun

Baoshan Sun received the Ph.D degree in Tianjin Polytechnic University, China. He joined the School of computer science and technology, Tianjin Polytechnic University, as associate professor. He has accomplished 2 scientific research projects at national, 3 projects at ministerial level and 3 transverse projects. He has published over 10 papers in the most famous academic journals, participated in the publishing of 3 textbooks of computer science and won 3 patents for his inventions. Recently, He has undertaking 3 national research projects and 2 provincial and ministerial projects. He is mainly focuses on the studies of Machine Learning, National Language Processing and Computer Networks.

Mengying Ge

Mengying Ge is currently pursuing M.Eng. degree with the School of computer science and technology, Tianjin Polytechnic University. She is currently involved in Machine learning algorithms for Natural language processing. Her research interests include Visual question answering, Neural network, Deep learning, Natural language processing.

Peng Zhao

Peng Zhao is currently pursuing master degree with the School of computer science and technology, Tianjin Polytechnic University. He is currently involved in Machine learning algorithms for Natural language processing. His research interests include Machine learning, Neural network, Deep learning and Natural language processing.

Chunqing Li

Chunqing Li received the Ph.D degree in Tianjin Polytechnic University, China. He joined the School of computer science and technology, Tianjin Polytechnic University, as professor. He has accomplished 1 scientific research projects at national, 1 projects at ministerial level. He has published over 20 papers in the most famous academic journals, participated in the publishing of 1 textbooks of software engineering. Recently, He has undertaking 1 national research projects. He has been a peer reviewer at several international academic conferences. He is mainly focuses on the studies of Machine Learning, National Language Processing, Big Data technology and cloud computing.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.