Search in:

Connection Science Volume 35, 2023 - Issue 1

Submit an article Journal homepage

Open access

467

Views

CrossRef citations to date

Altmetric

Listen

Research Article

CFSE: a Chinese short text classification method based on character frequency sub-word enhancement

Xingguang Wanga School of Computer Science and Engineering, Anhui University of Science & Technology, Huainan, People’s Republic of China;b Artificial Intelligence Research Institute of Hefei Comprehensive National Science Center, Hefei, People’s Republic of ChinaView further author information

Shunxiang Zhanga School of Computer Science and Engineering, Anhui University of Science & Technology, Huainan, People’s Republic of China;b Artificial Intelligence Research Institute of Hefei Comprehensive National Science Center, Hefei, People’s Republic of ChinaCorrespondence[email protected]
View further author information

Zichen Maa School of Computer Science and Engineering, Anhui University of Science & Technology, Huainan, People’s Republic of China;b Artificial Intelligence Research Institute of Hefei Comprehensive National Science Center, Hefei, People’s Republic of ChinaView further author information

Yunduo Liua School of Computer Science and Engineering, Anhui University of Science & Technology, Huainan, People’s Republic of China;b Artificial Intelligence Research Institute of Hefei Comprehensive National Science Center, Hefei, People’s Republic of ChinaView further author information

Youqiang Zhanga School of Computer Science and Engineering, Anhui University of Science & Technology, Huainan, People’s Republic of China;b Artificial Intelligence Research Institute of Hefei Comprehensive National Science Center, Hefei, People’s Republic of ChinaView further author information

Article: 2263663 | Received 08 Jun 2023, Accepted 21 Sep 2023, Published online: 06 Oct 2023

Cite this article
https://doi.org/10.1080/09540091.2023.2263663
CrossMark

In this article

1. Introduction
2. Related works
3. CFS and relationship feature
4. Data feature acquisition
5. Text classification method
6. Experiment and analysis
7. Conclusions
Disclosure statement
Additional information
Footnotes
References

Full Article
Figures & data
References
Citations
Metrics
Licensing
Reprints & Permissions
View PDF PDF View EPUB EPUB

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

As a foundation task of natural language processing, text classification is widely used in information retrieval, public opinion analysis, and other related tasks. Facing the problem of sparse features of Chinese short texts, which affects the classification accuracy of Chinese short texts, this paper proposes a Chinese short text classification method based on the Character Frequency Sub-word Enhancement (CFSE), which can effectively improve the classification accuracy of Chinese short texts. First, the initial Chinese-character sequence is mapped to the corresponding Character Frequency Sub-word (CFS) sequence based on the global character¹ frequency information. Second, the relationship features among data are extracted based on BiLSTM-Att processing CFS sequence, and the semantic features of the initial Chinese-character sequence are obtained through ERNIE. Finally, these two kinds of features are fused and input into the text classifier to obtain the classification results. Experimental results show that the proposed method can improve the classification accuracy of Chinese short texts.

KEYWORDS:

Text classification
Chinese short text
character frequency sub-word
relationship features

1. Introduction

Text classification task has a significant impact on information retrieval, sentiment analysis and related tasks. It provides accurate search results in search services and helps understand users’ query intentions. It determines the emotional tendency of comments in social media sentiment analysis and improves the ability to monitor public opinion (Yan et al., Citation2022; Zhang et al., Citation2021 ). In addition, it can also be used for spam filtering and product review analysis, improving user experience and improving enterprise service quality. In conclusion, the text classification task has a wide range of application scenarios (Dai et al., Citation2022). Short texts are shorter in length and simpler in structure, and they have the advantages of being easy to process, real-time, adaptable to mobile devices and information extraction, and more in line with the expectations of ordinary users to communicate and express in a concise and fast way. Therefore, short text is more widely used in daily interaction scenarios than long text (Wu et al., Citation2013), such as social media, instant messaging and search services.

However, the shorter length and simpler structure of short texts bring about the problem of feature sparsity (Basabain et al., Citation2023; Feng et al., Citation2019; Zhang et al., Citation2022). This problem is common to languages with large vocabularies, such as Chinese, English, French, etc. This problem is mainly manifested in insufficient contextual information, missing character order information, and semantic ambiguity, which make feature learning of short texts difficult (Li, Xiang, et al., Citation2022; Zhou et al., Citation2022). There are two main ideas to optimise the problem: Strengthen the learning ability of the model, for example, a Bidirectional Convolutional Recurrent Network architecture based on the fusion of GUR and Bi-LSTM can better mine the feature information of the data (Onan, Citation2022); Increase the amount of input information through external knowledge, such as providing more effective features by repairing contextual information (Zhang et al., Citation2016) or enhancing semantic features based on internal properties of characters (Li, Zhu, et al., Citation2023; Meng et al., Citation2023).

Different from the existing research, this paper proposes a CFSE method based on the second optimisation idea. The core of the method is to use the CFS sequence to enhance the input of data feature information, and this CFS sequence is obtained by processing the initial text sequence by means of one-way mapping. We can extract a relationship feature through this sequence to enhance the feature information of the data, so as to optimise the negative effects caused by sparse features of Chinese short text. The purpose of this paper is to improve the classification accuracy of Chinese short text, so as to provide better technical support for the downstream work of text classification. The process of CFSE is shown in Figure . Firstly, the character-frequency dictionary of the initial corpus is established based on the global character frequency information, and the CFS corpus containing CFS sequences is established according to the dictionary. Secondly, the relationship features among data are extracted based on BiLSTM-Att processing CFS corpus, the semantic features of the initial corpus are obtained through ERNIE, and these two types of data features are fused to obtain the composite features. Finally, the text classification results are obtained based on the text classifier processing the composite features. The main contributions of this paper include:

Propose a method for mining hidden features among data. Different from the existing extraction methods based on the data semantics, this paper proposes the relationship features based on CFS sequence, which provides a new way to improve the feature integrity of Chinese short texts.
Construct a Chinese short text classification model based on CFS enhancement. Based on the fusion of relationship features and semantic features, the feature integrity of data is improved, and the accuracy of Chinese short text classification is improved.

Figure 1. Model framework.

The remainder of the paper is structured as follows: Section 2 introduces the work related to text classification. Section 3 describes the CFS and relationship features. Section 4 covers the data features acquisition. The model construction introduced in Section 5. In Section 6, the experiment and analysis are demonstrated. The conclusion of this paper is shown in Section 7.

2. Related works

2.1. Previous text classification methods

The previous text classification task methods can be divided into different genres such as rules, statistics & probability, etc., and the machine learning genre was developed in the statistics & probability genre, and the current deep learning genre was developed on this basis. What needs to be further explained is that there is no clear boundary between these genres, and the development of each other crosses each other.

In the rule-based genre, the implementation of text classification tasks relies on human-tailored features or rules. A representative method is to determine the text category by matching predefined keywords, key phrases (Salton & Buckley, Citation1988). Another representtative approach is to build a rule classifier such as CART (Breiman et al., Citation1984). The rule-based classification methods have the advantages of strong interpret-ability and high flexibility, and have attracted many researchers in the early stages of text processing. However, due to the high dependence on artificial rules, the inability to deal with complex relationships, and the difficulty of generalisation, research on such methods has gradually decreased.

At the same time, there were genres based on statistics & probability in parallel. The implementation of this type of method is to enable the classification model to learn the feature representation and classification rules of the text through the training data set. Classic methods of this period such as KNN (Cover & Hart, Citation1967), Naive Bayes (McCallum & Nigam, Citation1998), Support Vector Machines (Joachims, Citation1998) and Random Forests (Breiman, Citation2001), etc. The statistical and probability-based methods are simple, effective, scalable, and insensitive to feature selection. However, these methods still have obvious room for improvement in terms of the learning ability of the semantic and structural features of the data and the optimisation of the data sparsity problem (Jian, Citation2023).

2.2. Methods based on deep learning

With the development and iteration of the neural network, the neural network model is applied to text classification tasks. Kim (Citation2014) applied convolutional neural networks (CNN) to text classification tasks, achieving text feature mining and learning through multi-layer data convolution. Liu (Liu et al., Citation2016) optimised the text classification model by adding a data-sharing layer within long short-term memory networks (LSTM) based on multi-tasking learning and data-sharing mechanisms. Lai (Lai et al., Citation2015) considered the different advantages of CNN and LSTM networks and further pooled the feature information extracted by LSTM to effectively improve the accuracy rate of document-level classification tasks. Johnson (Johnson & Zhang, Citation2017) proposed a DPCNN method to broaden the model’s views to address the issue of CNN models being unable to extract distant dependencies of words in the text. Based on the development of the attention mechanism, Zhou (Zhou et al., Citation2016) integrated the attention mechanism into LSTM to better mine the feature information of data and improve the accuracy of text classification.

Subsequently, with the development of the Transformer framework (Vaswani et al., Citation2017), large pre-trained models such as BERT (Devlin et al., Citation2019) and ERNIR (Zhang et al., Citation2019) were successively applied to text classification tasks. Fu (Fu et al., Citation2022) integrates text topic features into the BERT-based short text classification task of Chinese electric work orders to perform classification tasks. In the Chinese online collaborative discussion short text classification scenario, Li (Li, Deng, et al., Citation2023) extracted two kinds of semantic feature information of text by using BERT and hypergraph convolutional networks respectively, and improved semantic feature integrity based on multi-feature fusion. In the Chinese news short text classification task,Wang (Wang et al., Citation2022) further processed the character vector encoding obtained based on RoBERTa by means of attention pooling, so that deeply mined the hidden features of the text to improve the accuracy of text classification. In the multi-classification task of Chinese legal texts and public health questions texts, based on BERT, Li (Li, Su, et al., Citation2023 ) used multi-granularity information relations to improve the integrity of data features and the accuracy of text classification. Shunxiang (Shunxiang et al., Citation2023) introduces emotional features into the task of classifying fake reviews of English restaurants, and uses the strong and weak emotion set strategy to improve the accuracy of the task of classifying fake reviews.

The above methods provide effective optimisation schemes for short text classification tasks from the perspective of model optimisation or external knowledge enhancement. Different from the existing work, this paper tries to explore a kind of feature information that can be used for information enhancement from the perspective of character frequency mapping. Based on the above considerations, this paper proposes CFSE.

3. CFS and relationship feature

The CFS is a kind of fuzzy granularity information (Rohidin et al., Citation2022) between character granularity and word granularity, which is obtained by using the character frequency information clustering method to map characters. CFS can enhance the correlation information among data, and using CFS sequences can more effectively extract the relationship features of data. The relationship features can improve the problem of sparse feature information in Chinese short texts by enhancing the amount of feature information.

3.1. CFS

Definition 3.1:

CFS is sub-word units that contain one or more characters. The frequency information of one or more characters in the same CFS has the same value on the character frequency attribute.

According to the value, the key is grouped into different sub-word units. This sub-word unit is called CFS. A CFS may contain multiple keys because some values are equal. Therefore, the following relationship exists:

A key only corresponds to a CFS.
A CFS corresponds to one or more keys.

Let the character-set is $A$ and the CFS-set is $B$ . Elements of $A$ are characters and elements of $B$ are CFS. (1) $C F S^{x} = {c^{1}, c^{2}, c^{3}, \dots \dots}$ (1) where, $c^{1}$ , $c^{2}$ and $c^{3}$ are elements in $A$ and have the same frequency attribute, and $x$ is the corresponding value of character attribute. The elements of $A$ and $B$ have the following relationship:

Any element $c^{x}$ in $A$ uniquely corresponds to an element $C F S^{x}$ in $B$ .
Any element $C F S^{x}$ in $B$ has one or more corresponding elements in $A$ .

3.2. CFS sequence

Definition 3.2:

The CFS sequence is a sequence composed of CFS-based granularity. Based on the correspondence between characters and CFS, a character sequence has a unique corresponding CFS sequence.

Its mathematical form can be expressed as: (2) $\begin{aligned} character sequence & = [c_{1}, c_{2}, \dots, c_{i}, \dots, c_{n}] \\ = \sum_{i = 1}^{n} c_{i} \end{aligned}$ (2) where, $c_{i}$ is the i-th character of the character sequence. (3) $\begin{aligned} C F S sequence & = f (\sum_{i = 1}^{n} c_{i}) \\ = \sum_{i = 1}^{n} C F S_{i} \end{aligned}$ (3) where, $f$ represents the mapping relationship from character conversion to CFS. $C F S_{i}$ represents the i-th character of the CFS sequence.

Definition 3.3:

CFS corpus is a collection of CFS sequences. There is a unique correspondence between the initial corpus data items and the CFS corpus data items. Its mathematical form can be expressed as: (4) $\begin{aligned} initial corpus & = [tex t_{1}, \dots, tex t_{i}, \dots, tex t_{n}]^{T} \\ = \sum_{i = 1}^{n} tex t_{i} \end{aligned}$ (4) where, $tex t_{i}$ represents the i-th data item of the initial corpus, which is the i-th character sequence. (5) $\begin{aligned} C F S corpus & = f (\sum_{i = 1}^{n} tex t_{i}) \\ = \sum_{i = 1}^{n} C F S \cdot tex t_{i} \end{aligned}$ (5) where, $f$ represents the mapping relationship from character conversion to CFS. $C F S \cdot tex t_{i}$ represents the i-th data item of the CFS corpus.

From the perspective of relational information, a special case is adopted to further explain the above definition.

Hypothesis 3.1

The character frequency attribute of $c^{1}$ , $c^{2}$ and $c^{3}$ in the text sequence corpus is i (0 < i ≤ n, and n is the largest number in the character frequency attribute).

According to Definition 3.1, all three characters belong to $C F S^{i}$ . (6) $c^{1}, c^{2}, c^{3} \in C F S^{i}$ (6)

Hypothesis 3.2

Let $c^{1}$ , $c^{2}$ and $c^{3}$ appear only once in each character sequence of the initial corpus, and there is no concurrent occurrence among the three characters.

As the node center, $c^{1}$ can point to i statement sequences. Based on the pointing relationship, the corresponding i statement sequences have an association relationship. The same is true for $c^{2}$ and $c^{3}$ , but the sequence of correlations among the three is independent. (7) $\begin{aligned} point \cdot set - c^{1} & = i \\ point \cdot set - c^{2} & = i \\ point \cdot set - c^{3} & = i \end{aligned}$ (7) and (8) $point \cdot set - c^{1} \cap point \cdot set - c^{2} \cap point \cdot set - c^{3} = \emptyset$ (8) Based on Hypothesis 3.1, the frequency attribute of $C F S^{i}$ in the CFS corpus is 3*i, that is, as the node center, $C F S^{i}$ can point to 3*i statement sequences. Based on the pointing relation, the association relation among the corresponding 3*i statement sequences is not independent. (9) $point \cdot set - C F S^{i} = 3 * i$ (9) Example description : If “你”, “我” and “他” conform to Hypothesis 3.1 and 3.2, and i = 100, then these three characters will be assigned to $C F S^{100}$ .

Based on the above three characters, three separate association relation sets can be obtained, and each set contains 100 association statements.

Converts the character sequence of the text to a CFS sequence. “你”, “我” and “他” are all converted to $C F S^{100}$ .

A set of 300 statement sequences can be retrieved through the reference relation of $C F S^{100}$ . (12) $point \cdot set - C F S^{100} = 300$ (12) Based on the correspondence between characters and CFS, text sequences in the initial corpus are converted into corresponding CFS sequences, and the hidden correlation among text sequences is enhanced by means of CFS sequences.

When data is converted from a character sequence to a CFS sequence, the semantic sequence of the data is destroyed due to the change of granularity information, which will lead to the loss of data semantic information. But the correlation among data sequences will be enhanced through the pointing relationship of each CFS to CFS sequence. Furthermore, this transformation operation will reduce the size of the basic granular vocabulary, and then increase the correlation among data sequences, so as to improve the sparseness of data features. A more intuitive analysis is in Section 6.2 Experiment 1.

3.3. Relationship feature

Definition 3.4:

Relational feature is a kind of data feature extracted using the CFS sequence corresponding to the initial text.

Based on the corresponding CFS sequence of the character sequence, the relationship features of the data can be extracted.

Suppose a text has n characters in its character sequence. (13) $Text = [c_{1}, c_{2}, c_{3}, \dots \dots, c_{n}]$ (13) If the $point \cdot set$ of any character $c_{i}$ is $X_{i}$ , the relationship features contained in the feature information $T$ obtained based on the character sequence can be represented as: (14) $\begin{aligned} T & = X_{1} + X_{2} + X_{3} + \dots + X_{n} \\ = \sum_{1}^{n} X_{i} \end{aligned}$ (14) Convert the text to the corresponding CFS sequence, then (15) $\begin{aligned} C F S & = f ([c_{1}, c_{2}, c_{3}, \dots \dots, c_{n}]) \\ = [C F S - c_{1}, C F S - c_{2}, C F S - c_{3}, \dots \dots, C F S - c_{n}] \end{aligned}$ (15) The relationship features contained in the feature information obtained based on CFS sequences can be represented as (16) $T^{C F S} = \sum_{1}^{n} X_{i}^{C F S}$ (16) where, (17) $X_{i}^{C F S} = k \times X_{i}$ (17) where, $k$ is the number of characters with character attribute $X_{i}$ .

As mentioned at the end of Section 3.2, when converting a character sequence to a CFS sequence, the semantic sequence of the text is destroyed, which will cause the data to lose a lot of semantic information. But at the same time, the hidden association information between this data and other data will be enhanced.

In this paper, the feature information obtained based on the CFS sequence is called relationship feature.

3.4. Pre-processing

The relationship features of the data are obtained based on the CFS corpus corresponding to the text data. The character frequency information is used in the process of acquiring CFS corpus. The character frequency information is obtained from the global character-frequency dictionary. The global character-frequency dictionary is obtained from the statistical characteristics of the characters in the whole initial corpus. Using the character-frequency dictionary, the character sequences of text are converted to CFS sequences.

The process of acquiring CFS corpus is as follows:

Extract non-labeled data information from the initial corpus and concatenate all data into a single string;
Extract all characters of the string and the frequency information corresponding to each character (both punctuation and non-Chinese characters are considered valid character information);
Build the character-frequency dictionary based on the extracted characters and character frequency information (key = character, value = character frequency);
The non-label information of the initial corpus is processed based on the key of the dictionary, and the corresponding value corpus of the initial corpus is generated (Figure ).

Figure 2. Process of obtaining CFS corpus.

Because the character frequency of the initial corpus overlaps, the character granularity information with the same character frequency is divided into the same CFS. Therefore, the basic granularity of the value corpus is CFS, so it is called CFS corpus.

3.5. Relationship feature representation ability test

Up to now, this paper has not found relevant literature on the availability of relationship features in the deep learning model. Therefore, this paper validates the information representation ability of relationship features using existing mainstream text classification models. The model for verification includes three main feature extractors, namely RNN, CNN, and Transformer. The models used include TextCNN, TextRNN, TextRNN_Att, TextRCNN, DPCNN, and Transformer.

In the case of multi-test model using multi-feature extractor, the generalisation ability and effectiveness of relationship features are fully verified, which provides feasibility support for the Chinese short text classification model based on CFS enhancement.

Detailed experiments and analyses are shown in Section 6.3. Based on the experiments and analyses in Sections 6.3, it is shown that relationship features are effective in the Chinese short text classification task.

4. Data feature acquisition

Based on the work in Section 3, this paper considers using association information to enrich the data features and improve the classification accuracy of Chinese short texts.

Based on the preprocessing method in Section 3.4, the CFS corpus of the initial corpus is extracted, and then the composite features of the text are obtained based on the two corpora. Composite features are formed by the fusion of relationship features and semantic features. The extraction and fusion methods of the two features are described in Sections 4.1–4.3.

4.1. Relationship features

At present, there is no available pre-training model for relationship feature coding. As shown in Figure , BiLSTM-Att extracts the forward context feature $\vec{h}$ and backward context feature $\overset{\leftarrow}{h}$ of CFSs through the bidirectional long short-term memory network (BiLSTM), and obtains the contextual feature $h$ of characters based on feature concatenation, and optimises the weight of CFSs vectors in sentences through the Attention mechanism, so as to obtain the coding of better relationship features. Therefore, this paper uses BiLSTM-Att as a relationship features extractor of CFS data.

Figure 3. Structure chart of BiLSTM-Att.

Given a CFS sequence $S 1 = {a 1, a 2, \dots, a i, \dots, a n}$ , for any CFS $a_{i}$ , as shown in Formula 18, which value in the embedding matrix is first looked up: (18) $A^{ai} \in E^{a w | V |}$ (18) where, $V$ is the scale of the vocabulary, which is 1490 in the experimental data environment of this paper, $a_{w}$ represents the embedding dimension of the CFS vector, and matrix $A^{a i}$ is the parameter to be learned. $a_{i}$ is converted to an embedding vector $e_{i}$ by embedding the matrix. Thus, the CFS sequence is converted to the real-valued vector sequence $e m b s (S 1) = {e 1, e 2, e 3, \dots \dots, e i}$ .

The obtained sequence of real-value vectors is input into the BiLSTM_Att model, and the output is the relationship features vector $O_{C}$ adjusted by the Attention mechanism with contextual information.

4.2. Semantic features

ERNIE is a pre-training model for the Chinese environment improved by BERT. ERNIE has learned a large number of high-quality Chinese corpus through self-supervised learning, and has a better performance in the Chinese field than BERT. ERNIE uses Transformer as a feature extractor to deeply mine the implicit information within the text through a multi-level mask mechanism.

The principle of the multi-level masking mechanism is shown in Figure :

Using Basic Level Masking to mask single characters;
Using Phrase Level Masking to mask continuous phrases;
Using Entity Level Masking to recognise entities, and masking the identified entities.

Figure 4. Multi-level mask mechanism.

Through the multi-level mask mechanism, ERNIE can effectively capture multi-level contextual information of characters and obtain good semantic features. Therefore, this paper obtains the semantic features of text data based on ERNIE.

Based on the vocabulary, the character information of the text corpus is converted into the corresponding number information. This information inputs into the ERNIE model. The output of the model is a semantic feature vector $O_{T}$ that integrates multi-level context information (Figure ).

Figure 5. Structure chart of ERNIE.

4.3. Composite features

Composite features are composed of relationship features and semantic features. Among them, relationship features are obtained by means of Section 4.1, and semantic features are obtained by means of Section 4.2. The acquisition method of composite features is shown in Formula 19, and its algorithm is shown in Algorithm 1 (Figure ). (19) $O = [O_{C}, O_{T}]$ (19)

Figure 6. Process for obtaining composite features.

Table

Download CSV Display Table

Algorithm description: The Algorithm consists of three parts. Step 1 is used to extract corpus information. Steps 2–3 are to obtain the relationship features and semantic features of the data. Step 4 is to obtain composite features.

The time complexity of obtaining batch data is generally $O (1)$ . The time complexity of obtaining the relationship features by BiLSTM_Att is generally $O (n)$ . The time complexity of obtaining semantic features through ERNIE is generally $O (n)$ . Therefore, the total time complexity of Algorithm 1 is $O (n)$ .

5. Text classification method

In order to optimise the problem that the sparse features affect the classification accuracy of Chinese short texts, this paper proposes CFSE. The core of this method is to expand the data features of Chinese short texts by means of CFS information. Firstly, the CFS corpus of the initial text corpus is obtained through Section 3.5. Secondly, composite features are obtained by the method described in Sections 4.1–4.3. Finally, the text’s category tags are obtained by using the Softmax classifier.

5.1. Softmax classifier

Softmax is a classic multi-classifier, which uses an exponentially enhanced method to obtain text classification labels.

Let the composite feature be $O = {X 1, X 2, X 3, \dots, X i}$ , $O$ is processed by a fully connected layer to obtain feature vector $O_{k} = {x 1, x 2, x 3, \dots, x k}$ , and $k$ is the given number of categories. Input $O_{k}$ into the Softmax classifier and the output is the sentence’s classification label $y_{m}$ , where $m \in [1, k]$ . The process of the Softmax classifier is shown in Figure , and its principle is shown in Formulas 20–21. (20) $\begin{aligned} y_{m} & = max (h (O k)) \\ = max ([h (x 1), h (x 2), \dots, h (x k)]) \end{aligned}$ (20) where, (21) $h (x m) = \frac{e^{x_{m}}}{\sum_{i = 1}^{k} e^{x_{i}}}$ (21) where $h (x_{m})$ is the probability that current sentence sequence belongs to category $y_{m}$ .

Figure 7. Diagram of the Softmax classifier.

5.2. CFSE

In order to enrich the data features of Chinese short texts and improve the classification accuracy of Chinese short texts, CFSE is proposed. Firstly, the CFS corpus of the initial corpus is constructed based on the global character frequency information. Secondly, the relationship features are extracted using BiLSTM-Att to process the CFS corpus, and ERNIE to process the initial corpus to extract the semantic features. Then, the composite features are obtained by the fusion of the two features. Finally, the classification results of the data are obtained by processing the composite features using the Softmax classifier. The network structure of CFSE is shown in Figure .

Data extraction layer. The purpose of this layer is to construct the CFS corpus corresponding to the initial corpus and to form a compound corpus with the initial corpus, so as to provide the initial data for feature extraction.
Features extraction layer. The function of this layer is to obtain the relationship and semantic features of the data based on the CFS corpus and initial corpus.
Features fusion layer. Through the fusion of relationship features and semantic features, the composite features enhanced by CFS are constructed.
Tag prediction layer. The composite features are processed by a fully connection layer and then input to the Softmax classifier, which outputs the classification result of the data.

Figure 8. The network structure of CFSE.

The algorithm of CFSE is shown in Algorithm 2.

Table

Download CSV Display Table

Algorithm description: This algorithm mainly consists of four stages. Steps 1–3 is used to create a database. Step 4 initialises the model. Steps 5–10 include a for loop for model training. Steps 11–13 is used to obtain the data classification result.

The time complexity of establishing the database is $O (1)$ . The time complexity of initialising the model is $O (1)$ . The time complexity of the training model is generally $O (i * n)$ , where $i$ is a constant and is determined by the training stop condition. The time complexity of obtaining class labels is $O (n)$ . Therefore, the overall time complexity of this Algorithm is $O (n)$ .

6. Experiment and analysis

6.1. Dataset

The data sets used in the experiment are THUCNews, Taobao-PR (Taobao product reviews) and Sougou-news. The information after data cleaning is shown in Table .

Table 1. Dataset specification.

Download CSV Display Table

THUCNews includes 10 categories: finance, economics, real estate, stocks, education, science, technology, society, current politics, sports, games and entertainment. Each category contains 20,000 items of data.

Taobao-PR contain two categories: positive and negative. 30,000 items of each type of data. Each category contains 30,000 items of data.

Sougou-news includes 10 categories: culture, entertainment, sports, finance, car, education, technology, military, travel and world. Each category contains 20,000 items of data.

6.2. Experiment 1: vocabulary analysis

Extract the vocabularies of the experimental corpus in the form of text and CFS and the frequency information of elements in the corresponding corpus. The size of the vocabulary of the two types of corpus is shown in Figure , and the frequency information of the elements in the corresponding corpus is shown in Figure .

The CFS form of the data strengthens the correlation among the data.

Figure 9. Vocab size.

Figure 10. Frequency distribution information of elements.

As can be seen from Figure , the scale of the vocabulary is reduced by about 68.71% when the initial corpus is converted into the CFS form. The reason for the reduction of the vocabulary is that the character frequency attribute of the character granularity data is the same, so it is summarised into the CFS. The shrinking of the vocabulary shows that the corpus of CFS form narrows the range of data distribution and reduces the discrete interval among data, thus enhancing the hidden association information among short texts.

As can be seen from Figure , compared with the character, the frequency distribution curve of basic units of CFS corpus is smoother, and the proportion of low-frequency information is effectively reduced. Reducing the proportion of low-frequency information can alleviate the problem of sparse features in short texts, and the reason for the reduction of low-frequency information is the use of character frequency clustering to enhance the correlation among short texts.

6.3. Experiment 2: validity of relationship features

Experiment 2 verifies the effectiveness of obtaining relationship features from the CFS corpus, and compares the results with those obtained from using the initial corpus. The results of Experiment 2 are shown in Table , Table and Table .

The relationship features extracted from the CFS corpus can effectively represent text information. The reasons are as follows:

Based on Table , Table , and Table , it can be seen that the average values of the four evaluation indicators obtained using the CFS corpus were 87.88%, 87.91%, 87.80%, and 87.80%, respectively, when using six test models containing three kinds of feature extractors. The average values of the four evaluation indicators obtained by using the initial corpus were 88.38%, 88.26%, 88.28%, and 88.26% respectively. The difference between the former and the latter in the four indexes is 0.49%, 0.36%, 0.48%, and 0.47%, respectively.

From the experimental results of the same model, it can be seen that the index results obtained by the experiment have changed, but the magnitude of the change is small. Even in some experiments of the third group, using the CFS sequence will obtain better experiments result. This result shows that the CFS sequence obtained by using the character frequency information to perform one-way mapping on the initial sequence can effectively represent the characteristics of the data.

Through an in-depth analysis of the reasons for this change, it can be found that only the input data has changed during the entire experiment. This data change is caused by mapping the initial text sequence to a CFS sequence using the method described in Chapter 3. More specifically, this change is due to the fact that the basic granularity of the data sequence is changed from character to CFS. As mentioned at the end of Section 3.3, when the initial text sequence is mapped to the corresponding CFS sequence, the semantic information carried by the data is reduced, but the relational information it carries is increased. When the added relational information is greater than the lost semantic information, the model can obtain more effective data features, which will increase the experimental indexes. Likewise, when the added relational information is less than the lost semantic information, the experimental indexes will drop. The above two situations are directly reflected in the experimental results obtained on the three data sets. The above analysis can show that the relational features extracted based on the CFS sequence can effectively represent the features of the initial text.

6.4 Experiment 3: ablation experiment

Based on ablation experiments, verify the effectiveness of relationship features in the model. The experimental results are shown in Table .

Model 1: CFSE-ERNIE. Cancel the part of the model that extracts semantic features, and use the combination of BiLSTM+Att + Softmax to achieve Chinese short text classification based on relationship features of the data;
Model 2: CFSE-BiLSTM. Cancel the part of the model that extracts relationship features, and use the combination of ERNIE + Softmax to achieve Chinese short text classification based on semantic features of the data;
Model 3: CFSE. The complete CFSE model; uses a combination of ERNIE + BiLSTM+Att + Softmax to achieve Chinese short text classification based on relationship features and semantic features of the data.
Data features enhanced by relational information can improve the classification accuracy of Chinese short texts.

Table 5. Ablation experiments.

Download CSV Display Table

It can be seen from Table that compared with the method of using one of the features alone, the experimental indexes of the CFSE model that fuses two feature information are increased. More specifically, the accuracy increases by 0.19% on Dataset 2, and by 0.51% on Dataset 1 and Dataset 3, respectively. This phenomenon illustrates the effectiveness of CFSE. At the same time, there are differences in the growth of the three data sets, which are caused by the different data segmentation ratios and data classification requirements. The processing requirement of Dataset 1 and Dataset 3 is classification of news headlines, while that of Dataset 2 is sentiment classification of user comments. This phenomenon also indirectly shows that the fusion of the two features cannot better improve the emotional information contained in the features.

6.5. Experiment 4: comparison experiment

In order to fully verify the rationality of this method, it is compared with the existing model. Methods of comparison include:

TextCNN: a model for text classification using convolutional neural networks;
TextRNN: a model for text classification using bidirectional long short-term memory networks (BiLSTM);
TextRNN_Att: BiLSTM integrates attention mechanism for text classification;
TextRCNN: a model for text classification using BiLSTM and 1-max-pooling;
TextRCNN: a model for text classification using BiLSTM and 1-max-pooling;
Transformer: the basic Transformer model;
BERT: BERT released by Google AI is based on Transformer.
CFSE is superior to the existing model.

As can be seen from Table , Table and Table , BERT has the best index performance among the comparison methods, with its average-Accuracy being 92.24%. The average-Accuracy of the CFSE method proposed in this paper is 92.91%, respectively. Compared with BERT, the index of CFSE method is improved by 0.66%, which is the best among all the models.

Table 6. Comparative experiment (THUCNews).

Download CSV Display Table

Table 7. Comparative experiment (Taobao-PR).

Download CSV Display Table

Table 8. Comparative experiment(Toutiao-news).

Download CSV Display Table

The reason for the improvement of the index of CFSE method is that this method specially deals with the relationship characteristics of data, so it enhances the representation ability of the model to the data.

The purpose of this paper is to mine hidden relationship information among data based on CFS, and then improve the feature integrity of the data. The above four experiments indicate that the proposed CFSE can improve classification accuracy and achieve better classification results.

7. Conclusions

This paper proposes the CFSE to improve the classification accuracy of Chinese short text. CFSE reduces the negative impact of feature sparsity by mining the hidden relationship feature. Firstly, a relationship feature based on CFS is proposed from the perspective of data relationship. The feature is extracted by means of a kind of fuzzy granularity information called CFS defined in this paper. The function of CFS is to enhance the correlation information among data and assist in extracting the relationship characteristics of data. Secondly, the relationship features and the general semantic features are fused, and then the obtained composite features are used as the final feature information of the short text for classification prediction. This method enriches the data features by mining the correlation information among the data, reduces the negative impact brought by sparse features, and effectively improves the classification accuracy of Chinese short texts, thereby providing more effective technical support for task scenarios such as information retrieval and public opinion analysis.

The analysis of CFS corpus shows that, within the limited data range, there is a high degree of feature similarity among some character granularity information, which means that there is redundant information in the data. How to effectively resolve redundant data is a problem to be further studied.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

This work was supported by Graduate Innovation Fund project of Anhui University of Science and Technology [grant number 2022CX2127]; National Natural Science Foundation of China [grant number 62076006]; Anhui Province University Natural Science Research Project [grant number 2023AH050846]; The Opening Foundation of State Key Laboratory of Cognitive Intelligence [grant number COGOS-2023HE02], and by the University Synergy Innovation Program of Anhui Province [grant number GXXT-2021-008].

Notes

1 Character in this paper refers to a single Chinese character.

References

Basabain, S., Cambria, E., Alomar, K., & Hussain, A. (2023). Enhancing Arabic-text feature extraction utilizing label-semantic augmentation in few/zero-shot learning. Expert Systems, 40(8), e13329. https://doi.org/10.1111/exsy.13329
Web of Science ®Google Scholar
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. https://doi.org/10.1023/A:1010933404324
Web of Science ®Google Scholar
Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and regression trees (CART). Biometrics, 40(3), 358. https://doi.org/10.2307/2530946
Google Scholar
Cover, T., & Hart, P. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13(1), 21–27. https://doi.org/10.1109/TIT.1967.1053964
Web of Science ®Google Scholar
Dai, Y., Shou, L., Gong, M., Xia, X., Kang, Z., Xu, Z., & Jiang, D. (2022). Graph fusion network for text classification. Knowledge-Based Systems, 236, 107659. https://doi.org/10.1016/j.knosys.2021.107659
Web of Science ®Google Scholar
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171–4186. https://doi.org/10.18653/v1/N19-1423
Google Scholar
Feng, Y., Qu, B., Xu, H., & Wang, R. (2019). Chinese FastText short text classification method integrating TF-IDF and LDA. Journal of Applied Sciences, 37(3), 378. https://doi.org/10.3969/j.issn.0255-8297.2019.03.008
Google Scholar
Fu, W., Yang, D., Ma, H., & Wu, D. (2022). Short text classification method based on BTM and BERT. Computer Engineering and Design, 43(12), 3421–3427. https://doi.org/10.16208/j.issn1000-7024.2022.12.016
Google Scholar
Jian, L. X. S. Q. S. (2023). Automatic classification of product review texts combining short text extension and BERT. Journal of Information Resources Management, 13(1), 129. https://doi.org/10.13365/j.jirm.2023.01.129
Google Scholar
Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features. In C. Nédellec & C. Rouveirol (Eds.), Machine learning: ECML-98 (pp. 137–142). Springer. https://doi.org/10.1007/BFb0026683
Google Scholar
Johnson, R., & Zhang, T. (2017). Deep pyramid convolutional neural networks for text categorization. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 562–570. https://doi.org/10.18653/v1/P17-1052
Google Scholar
Kim, Y. (2014). Convolutional neural networks for sentence classification. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1746–1751, https://doi.org/10.3115/v1/D14-1181
Google Scholar
Lai, S., Xu, L., Liu, K., & Zhao, J. (2015). Recurrent convolutional neural networks for text classification. Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, 2267–2273. https://doi.org/10.1109/IJCNN.2019.8852406
Google Scholar
Li, B.-H., Xiang, Y.-X., Feng, D., He, Z.-C., Wu, J.-J., Dai, T.-L., & Li, J. (2022). Short text classification model combining knowledge aware and dual attention. Journal of Software, 33(10), 3565–3581.
Google Scholar
Li, F.-F., Su, P.-Z., Duan, J.-W., Zhang, S.-C., & Mao, X.-L. (2023). Multi-label text classification with enhancing multi-granularity information relations. Journal of Software, 1–18. https://doi.org/10.13328/j.cnki.jos.006802
Google Scholar
Li, S., Deng, M., Shao, Z., Chen, X., & Zheng, Y. (2023). Automatic classification of interactive texts in online collaborative discussion based on multi-feature fusion. Computers and Electrical Engineering, 107, 108648. https://doi.org/10.1016/j.compeleceng.2023.108648
Web of Science ®Google Scholar
Li, X., Zhu, G., Zhang, S., & Wei, Z. (2023). RCRE: Radical-aware causal relationship extraction model oriented in the medical field. International Journal of Computational Science and Engineering, https://doi.org/10.1504/IJCSE.2023.10054227
Web of Science ®Google Scholar
Liu, P., Qiu, X., & Huang, X. (2016). Recurrent neural network for text classification with multi-task learning. Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, 2873–2879. https://doi.org/10.48550/arXiv.1605.05101
Google Scholar
McCallum, A., & Nigam, K. (1998). A comparison of event models for naive Bayes text classification. AAAI Conference on Artificial Intelligence. https://api.semanticscholar.org/CorpusID:7311285
Google Scholar
Meng, J. S. C., Shan, H., Huang, R., Yan, F., Li, Z., Zheng, G., & Liu, Y. (2023). Text classification model based on dual-channel feature fusion based on XLNet. Journal of Shandong University(Natural Science), 58(5), 36. https://doi.org/10.6040/j.issn.1671-9352.0.2021.790
Google Scholar
Onan, A. (2022). Bidirectional convolutional recurrent neural network architecture with group-wise enhancement mechanism for text sentiment classification. Journal of King Saud University - Computer and Information Sciences, 34(5), 2098–2117. https://doi.org/10.1016/j.jksuci.2022.02.025
Web of Science ®Google Scholar
Rohidin, D., Samsudin, N. A., & Deris, M. M. (2022). Association rules of fuzzy soft set based classification for text classification problem. Journal of King Saud University - Computer and Information Sciences, 34(3), 801–812. https://doi.org/10.1016/j.jksuci.2020.03.014
Web of Science ®Google Scholar
Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information Processing & Management, 24(5), 513–523. https://doi.org/10.1016/0306-4573(88)90021-0
Web of Science ®Google Scholar
Shunxiang, Z., Aoqiang, Z., Guangli, Z., Zhongliang, W., & KuanChing, L. (2023). Building fake review detection model based on sentiment intensity and PU learning. IEEE Transactions on Neural Networks and Learning Systems, https://doi.org/10.1109/tnnls.2023.3234427
PubMed Web of Science ®Google Scholar
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. Proceedings of the 31st International Conference on Neural Information Processing Systems, 6000–6010. https://doi.org/10.48550/arXiv.1706.03762
Google Scholar
Wang, C., Jiang, H., Chen, T., Liu, J., Wang, M., Jiang, S., Li, Z., & Xiao, Y. (2022). Entity understanding with hierarchical graph learning for enhanced text classification. Knowledge-Based Systems, 244, 108576. https://doi.org/10.1016/j.knosys.2022.108576
Web of Science ®Google Scholar
Wu, F. L., Gou, J., & Wang, C. (2013). Review of Chinese short text classification. Applied Mechanics and Materials, 336–338, 2171–2174. https://doi.org/10.4028/www.scientific.net/AMM.336-338.2171
Google Scholar
Yan, C., Liu, J., Liu, W., & Liu, X. (2022). Research on public opinion sentiment classification based on attention parallel dual-channel deep learning hybrid model. Engineering Applications of Artificial Intelligence, 116, 105448. https://doi.org/10.1016/j.engappai.2022.105448
Web of Science ®Google Scholar
Zhang, S., Hu, Z., Zhu, G., Jin, M., & Li, K.-C. (2021). Sentiment classification model for Chinese micro-blog comments based on key sentences extraction. Soft Computing, 25(1), 463–476. https://doi.org/10.1007/s00500-020-05160-8
Web of Science ®Google Scholar
Zhang, S., Wang, Y., Zhang, S., & Zhu, G. (2016). Building associated semantic representation model for the ultra-short microblog text jumping in big data. Cluster Computing, 19(3), 1399–1410. https://doi.org/10.1007/s10586-016-0602-9
Web of Science ®Google Scholar
Zhang, S., Yu, H., & Zhu, G. (2022). An emotional classification method of Chinese short comment text based on ELECTRA. Connection Science, 34(1), 254–273. https://doi.org/10.1080/09540091.2021.1985968
Web of Science ®Google Scholar
Zhang, Z., Han, X., Liu, Z., Jiang, X., Sun, M., & Liu, Q. (2019). ERNIE: Enhanced language representation with informative entities. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 1441–1451. https://doi.org/10.18653/v1/P19-1139
Google Scholar
Zhou, P., Shi, W., Tian, J., Qi, Z., Li, B., Hao, H., & Xu, B. (2016). Attention-based bidirectional long short-term memory networks for relation classification. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 207–212. https://doi.org/10.18653/v1/P16-2034
Google Scholar
Zhou, Y., Li, J., Chi, J., Tang, W., & Zheng, Y. (2022). Set-CNN: A text convolutional neural network based on semantic extension for short text classification. Knowledge-Based Systems, 257, 109948. https://doi.org/10.1016/j.knosys.2022.109948
Web of Science ®Google Scholar

Download PDF

Share icon
Back to Top

Related research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.

People also read
Recommended articles
Cited by

To cite this article:

Reference style: APA Chicago Harvard

Citation copied to clipboard

Reference styles above use APA (6th edition), Chicago (16th edition) & Harvard (10th edition)

Download citation

Download a citation file in RIS format that can be imported by citation management software including EndNote, ProCite, RefWorks and Reference Manager.

Choose format: RIS BibTex RefWorks Direct Export

Choose options: Citation Citation & abstract Citation & references

Your download is now in progress and you may close this window

Did you know that with a free Taylor & Francis Online account you can gain access to the following benefits?

Choose new content alerts to be informed about new research of interest to you
Easy remote access to your institution's subscriptions on any device, from any location
Save your searches and schedule alerts to send you new results
Export your search results into a .csv file to support your research

Have an account?
Login now Don't have an account?
Register for free

Login or register to access this feature

Have an account?
Login now Don't have an account?
Register for free

Choose new content alerts to be informed about new research of interest to you
Easy remote access to your institution's subscriptions on any device, from any location
Save your searches and schedule alerts to send you new results
Export your search results into a .csv file to support your research

CFSE: a Chinese short text classification method based on character frequency sub-word enhancement

Abstract

1. Introduction

2. Related works