Full article: An enhanced text classification model by the inverted attention orthogonal projection module

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

ABSTRACT

The orthogonal projection method has made significant progress in text classification, especially in generating discriminative features. This method obtains more pure and suitable for classification features by projecting text features onto the orthogonal direction of common features (which are not helpful for classification and actually confuse performance). However, this approach requires an additional branch network to generate these common features, which reduces the flexibility of this method compared to representation optimisation methods such as self-attention mechanisms, as it requires significant modification of the base network structure to use. To address this issue, this paper proposes the Inversed Attention Orthogonal Projection Module (IAOPM). IAOPM uses inversed attention (IA) to iteratively reverse the attention map on text features, encouraging the network to remove discriminating features from the text features and obtain potential common features. Unlike the original orthogonal projection method, IAOPM can extract common features within a single network without any branch networks, increasing the flexibility of the orthogonal projection method. We also use an orthogonal loss to ensure the quality of the common features during training, so IAOPM also has better purity performance than the original method. Experiments show that text classification models based on IAOPM outperform the baseline models, self-attention mechanisms, and the original orthogonal projection method on multiple text classification datasets with an average accuracy increase of 1.02%, 0.44%, and 0.52%, respectively.

KEYWORDS:

1. Introduction

Text classification is an important task in natural language processing, such as the sentiment classification, question classification, etc. For text classification, deep learning models have been shown to outperform the traditional classification methods. Many neural networks and embedding techniques have been devised and applied, such as the Convolution Neural Network (CNN), Recurrent Neural Network (RNN), Long short-term memory (LSTM), Gated Recurrent Unit (GRU), etc. However, these neural network methods cannot take full advantage of discriminative features, and also be disturbed by other common features (or invariant features (Ganin & Lempitsky, Citation2015; Zhang et al., Citation2019), which do not help, rather confuse the classification) without class tendencies. Even the attention mechanism is able to give higher weights to the discriminative features and lower weights to the common features to alleviate this problem to some extent. However, due to the idiosyncrasy of the data and the inaccuracy of the attention mechanism (Qin et al., Citation2020), these low-weight common features still interfere with the representation learning and ultimately affect the text classification accuracy.

To address this issue, Qin et al. (Citation2020) proposed the use of an orthogonal projection to purify text features. By removing common features from the text, the system focuses more on discriminative features in the text, thereby improving the accuracy of text classification. Specifically, this method projects the text features onto the orthogonal direction of common features, and the resulting features are more discriminative for classification due to being perpendicular to the common features. As shown in Figure , assuming the text content is “This movie is very good”, common features such as “This”, “movie”, “is”, and “very” can be eliminated through orthogonal projection, leaving the discriminating feature “Good” remaining. This allows the network to be unaffected by the influence of common features. However, the purity of this method depends on the quality of the common features, thus requiring an additional branch network to separately extract these features to ensure their quality. Clearly, the additional network increases the cost of using this method and is less convenient compared to other representation optimisation methods such as attention mechanisms, which act as plug-in modules. Its advantages are also not obvious. If high-quality common features can be extracted within a single network, the flexibility of this method can be increased, making it more suitable for wider adoption and development. Therefore, this method still has room for improvement.

Figure 1. Orthogonal projection purifies text features.

In this study, we designed a simple module, called the Inversed Attention Orthogonal Projection Module (IAOPM), for extracting high-quality common features within a single network, thereby increasing the flexibility of the orthogonal projection method. Specifically, common features without classification bias are typically assigned lower weights in the attention distribution map of text features. IAOPM iteratively reverses the attention distribution map of text features through inversed attention (IA) to remove discriminative features, inducing the system to generate complementary latent common features, and then refines the text features through orthogonal projection. Finally, we propose a self-supervised loss function called Orthogonal Projection Loss (OPL). It ensures the shared nature of extracted common features within one or several batches, thereby ensuring high-quality common features extracted by IA. Additionally, OPL stabilises the training process and aids in generalisation. Compared to the original orthogonal projection method (Qin et al., Citation2020), IAOPM is relatively more flexible because all operations are performed within a single network, and therefore it does not add any branch networks. IAOPM can simply be added after a feature extractor (such as RNN, LSTM, CNN, etc.) to purify text features. In addition, through experiments and visualisation, we have demonstrated that IAOPM extracts higher quality common features, resulting in better purity performance compared to the original method.

We summarise the main contributions of this work as follows:

We propose using IA to generate common features, which runs only on the feature extraction network (also called the backbone network) and does not require any additional branch networks.
We propose the use of Orthogonal Projection Loss (OPL) to ensure the quality of the common features extracted by IA. OPL is used only as a form of regularisation without adding any additional network parameter.
Experimental results show that on benchmark text classification datasets (CR, MR, SST2, and Subj), IAOPM improves accuracy by 0.44% and 0.52% compared to self-attention mechanisms (Vaswani et al., Citation2017) and FP-Net (Qin et al., Citation2020).

2. Related work

Text classification is broadly divided into two main categories, supervised and unsupervised methods. Our work focuses on improving the orthogonal projection method under supervised classification. Therefore, we primarily discuss supervised methods while briefly introducing common features.

2.1. Supervised methods

Supervised methods obtain training directions from labelled data that improve the utilisation efficiency of the data and extract discriminative features. The most commonly used of these methods are deep neural networks, e.g. CNN use windows of different sizes to capture the key local features that are important for classification (Jing, Citation2019; Kim, Citation2014; Zhou et al., Citation2022). Unlike CNNs, the sequence models process text in a sequential manner by sharing recurrent units to learn a feature representation of texts and are then used for text classification, such as RNN, LSTM, GRU, etc. A large number of text classification studies based on RNN and its variants for feature extraction and downstream tasks (Huan et al., Citation2022; Jing, Citation2019; Sivakumar & Rajalakshmi, Citation2021). These neural network methods generally rely on the final hidden state of a network (such as the final hidden state of RNN or CNN) to create the final instance-level representation. However, they are unable to sift through the plethora of information for discriminative features that are helpful for classification. Attention mechanism (Mnih et al., Citation2014) is used for text classification to resolve this problem. By quickly filtering out the high-valued information from a large amount of information, the attention mechanism can guide the system to focus on these discriminative features. For example, Hu and Zhao (Citation2021) proposed a Bi-direction Gated Recurrent Unit (Bi-GRU) model based on pooling and attention mechanism, which utilises self-attention mechanisms to obtain information about the influence of words and sentences for text classification. Ruan et al. (Citation2022) proposed an ATT-CN-BILSTM Chinese news classification model based on the attention mechanism. The model uses the attention mechanism to improve the feature extraction process of CNN and BiLSTM. In addition, the Transformer (Vaswani et al., Citation2017) is a method that relies solely on self-attention mechanisms to compute text feature representations. This model has achieved excellent results in multiple natural language processing domains, including text classification. Bert (Devlin et al., Citation2018), Transformer-XL (Dai et al., Citation2019) and XLNet (Yang et al., Citation2019) combined a transformer with large corpora to further improve the accuracy of text classification (Lv et al., Citation2022; Shaheen et al., Citation2020; Wang & Zhang, Citation2022). However, these attention-based text classification methods cannot completely eliminate the influence of common features on classification as these features are given low weights, instead of being removed.

To eliminate the influence of common features, Qin et al. (Citation2020) first applied orthogonal projection to filter common features in text, which improved the text classification based on CNN, RNN, Transformer, and Bert and achieved good classification results. Inspired by this, more and more scholars have started to apply orthogonal projection to text classification, such as Liu et al. (Citation2022), who applied orthogonal projection to sentiment analysis tasks and proposed an Aspect Feature Distillation and Enhancement Network (AFDEN). Wei et al. (Citation2022) proposed GP-GCN to accomplish aspect-level sentiment analysis. Through orthogonal projection, GCN not only weakens the dependency of the graph node in updating process but also reduces the dependency between node and corpus. However, these orthogonal projection methods all use extra branch networks, thus still have room for improvement in terms of flexibility.

2.2. Common features

The concept of common features comes from domain adaptation, which refers to some shared features between the source and target domains. Among them, Blitzer et al. (Citation2006) proposed a method, called as the Structural Correspondence Learning (SCL), which mainly exploits multiple shared features to predict tasks.

Afterwards, Pan et al. (Citation2010) proposed Spectral Feature Alignment (SFA) algorithms to solve the feature mismatch problem by aligning domain-specific words with the help of domain-independent common features. But all the above methods require manual selection of common features. In recent years, many researchers have studied methods to automatically identify common features based on neural networks so as to improve the efficiency of common feature acquisition. Yu and Jiang (Citation2016) proposed two auxiliary tasks to learn sentence embeddings for CNN. An Adversarial Memory Network (AMN) was proposed by Li et al. (Citation2017), which automatically identifies common features by applying attention mechanisms and adversarial training. Zhang et al. (Citation2019) proposed an Interactive Attention Transfer Network (ATN) for cross-domain sentiment classification, which extracts common features through an independent branch network.

Inspired by these studies, Qin et al. (Citation2020) used common features and orthogonal projection to improve the accuracy of text classification, and proposed the Feature Purification Network (FP-Net). Subsequently, researchers began using common features to purify text features in order to improve text classification performance (Liu et al., Citation2022; Wei et al., Citation2022).

Finally, our work is related to several other works. In object detection, Huang et al. (Citation2020) used an efficient and fine-grain mechanism, called as the Inverted Attention (IA), to improve object detectors. By iteratively inverting the attention weights of features, the detector network with IA can discover new discriminative clues and put more attention on complementary objects, feature channels and even context. This ability to expand the system's vision allows IA to discover the hidden common features in texts. No branch network is added here as IA works only on text features.

Ranasinghe et al. (Citation2021) developed a novel loss function, termed as the Orthogonal Projection Loss (OPL). It can impose orthogonality in the feature space and directly enforce the inter-class separation alongside the intra-class clustering in the feature space through orthogonality constraints on the mini-batch level. Inspired by Ranasinghe et al. (Citation2021), we design an OPL in IAOPM to resolve the problem that the common features generated by IA are too similar to text features. Unlike Ranasinghe et al. (Citation2021), we do not pay attention to know whether the features of different classes are orthogonal. Thus, our work is significantly different.

3. Inverted attention orthogonal projection module

The IAOPM (Figure (right)) proposed in this work can extract common features from a single network, thus increasing the flexibility of the original orthogonal projection method (FP-Net, Figure (left)) to some extent. The IAOPM consists of three parts: common feature extraction based on IA, orthogonal projection, and orthogonal projection loss. Firstly, the attention map of text features is iteratively reversed using the IA module, which removes the discriminative features and thereby induces the system to generate complementary latent common features. Then, the text features are projected onto the orthogonal direction of these common features using the orthogonal projection module, resulting in purer discriminative features. Finally, the orthogonal projection loss is used to ensure the quality of the common features, improving the feature purification performance. Next, we will introduce each component of the proposed IAOPM in detail.

Figure 2. Structure of FP-Net and structure of IAOPM.

3.1. Common feature extraction based on IA

The goal of IAOPM is to extract common features from a single network. Its core technology is the use of IA to extract common features, as shown in Figure . This method consists of only two simple steps: (1) Attention Generation and (2) Inverted Attention to Generate Common Features. Firstly, IA generates the corresponding attention distribution map based on the attention scores of the text features. Then, unlike traditional attention mechanisms, IA removes the discriminative features with high weights by iteratively reversing the attention map, thereby guiding the system to obtain complementary low-weight common features. Through the IA method, we can extract common features within a single network.

Figure 3. IA generated common features, including Attention Generation and Inverted Attention to Generate Common Features.

3.1.1. Attention generation

Given a dataset $D = {(x_{i}, y_{i})}_{i = 1}^{N}$ , where $x_{i}$ is a text of length $L$ and $y_{i}$ is the corresponding label for sample $x_{i}$ . The dataset $D$ is transformed into text features $f_{t}$ through the feature extractor $F$ . In order to obtain the common features corresponding to the text features $f_{t}$ , it is necessary to first calculate the attention map $M$ corresponding to the text features $f_{t}$ . For simplicity, we use a parameter matrix $W_{Q}$ to generate the attention map, as shown in Equation (1). (1) $M = s o f t m a x (W_{Q} f_{t})$ (1)

3.1.2. Inverted attention to generate common features

Traditional attention mechanisms assign higher weights to discriminative features in text features and lower weights to features that are not conducive to classification. These low-weight features often belong to common features because they do not point to any specific category. Therefore, we use the inverted attention (IA) method to iteratively reverse the original attention map $M$ into an inverted attention map, forcing the network to focus on low-weight common features. Specifically, we identify discriminative features and common features through the inverted attention map $A = {a_{i}}$ , and remove the high-weight discriminative features, retaining the low-weight common features. The specific calculation method is: (2) $\begin{aligned} a_{i} & = {\begin{cases} 0 & i f m_{i} > T \\ m_{i} & e l s e \end{cases} \end{aligned}$ (2) (3) $\begin{aligned} f_{g} & = (A \cdot f_{t}) W_{V} \end{aligned}$ (3) where $a_{i}$ and $m_{i}$ are the inverted attention score and attention score of the $i - t h$ position, respectively. $T$ is a hyperparameter representing the Inverted attention threshold. Finally, the text features $f_{t}$ are multiplied by the inverted attention map $A$ and the weight $W_{V}$ to obtain the final common features $f_{g}$ .

As can be seen from Equation (2), in the attention distribution graph, the part with weight greater than $T$ will be removed, and the part with weight less than $T$ will be retained as shown in Figure (assuming $T = 7$ ). Then, the generated inverted attention score matrix $A = {a_{i}}$ is multiplied by the text feature $f_{t}$ and a mapping is performed to generate common features. These operations contain only 2 parameter matrices, the rest are math operations. Therefore, the method does not rely on any external branch network and it has a simple structure.

Figure 4. IA obtained low-weight information by resetting weight of high-weight features to 0.

3.2. Orthogonal projection

The Orthogonal Projection method removes irrelevant features from the text features by projecting the text features onto the orthogonal direction space of common features (which should contain purely discriminative and useful for classification features). The specific calculation method is: (4) $\begin{aligned} f_{t, p r o j} & = \frac{f_{t} \cdot f_{g}}{| f_{g} |} \frac{f_{g}}{| f_{g} |} \end{aligned}$ (4) (5) $\begin{aligned} f_{t, o r t h} & = f_{t} - f_{t, p r o j} \end{aligned}$ (5) where $f_{t} \cdot f_{g}$ is dot product operation and $| f_{g} |^{2}$ is the $L_{2}$ norm of $f_{g}$ .

To be more intuitive, the process is represented by a 2D image as shown in Figure . Specifically, the text feature vector $f_{t}$ consists of $f_{t, p r o j}$ and $f_{t, o r t h}$ . Since $f_{t, p r o j}$ is a component of common feature $f_{g}$ , $f_{t, p r o j}$ still consists of common features. Then, we can be found that $f_{g} f_{t, o r t h}^{T} = f_{t, p r o j} f_{t, o r t h}^{T} = 0$ . So, $f_{t, o r t h}$ is independent of $f_{t, p r o j}$ and $f_{g}$ . Therefore, $f_{t, o r t h}$ contains only discriminative features, which can be called as the pure features. Finally, use $f_{t, o r t h}$ for classification as expressed by Equation (6). (6) $Y_{I A O M} = s o f t max (f_{t, o r t h} \cdot W_{I A O M} + b_{I A O M})$ (6) where $W_{I A O M}$ and $b_{I A O M}$ are, respectively, the weights and biases of the classifier, and $Y_{I A O M}$ is the predicted result.

Figure 5. Orthogronal projection in 2D space.

3.3. Orthogonal projection loss

In general, common features shared by all categories should be concentrated in the feature space, as shown in Figure (a), where blue and yellow represent positive and negative classes, respectively, and the corresponding common features are represented in green and red. However, the Inverted Attention (IA) method uses text features $f_{t}$ to generate common features $f_{g}$ , making the text features $f_{t}$ and common features $f_{g}$ more similar. This results in the common features $f_{g}$ losing shared features to some extent and not being able to represent common features of the sample, as shown in Figure (b). Therefore, these low-quality common features may not be pure enough and still contain a lot of discriminative information, which can affect the effectiveness of subsequent orthogonal projection.

Figure 6. (a): The distribution of common features in 2D space in normal circumstances (b): The distribution of common features generated by IA in 2D space

To address the issue of the common features $f_{g}$ losing shared characteristics due to the IA method, we use the Orthogonal Projection Loss (OPL). OPL references the Orthogonal Loss proposed by Ranasinghe et al. (Citation2021). and is used to separate $f_{g}$ and $f_{t}$ and guide the network to aggregate $f_{g}$ . In other words, OPL is a constraint on the common features $f_{g}$ to ensure that $f_{g}$ does not lose shared characteristics. The calculation formula for OPL is as follows: (7) $\begin{aligned} s & = \frac{\sum_{i \in B} \sum_{(j \in B / i \neq j)} < f_{g}^{i}, f_{g}^{j} >}{\sum_{i \in B} \sum_{(j \in B / i \neq j)} 1} \end{aligned}$ (7) (8) $\begin{aligned} d & = \frac{\sum_{i \in B} \sum_{(j \in B / i \neq j)} < f_{g}^{i}, f_{t}^{j} >}{\sum_{i \in B} \sum_{(j \in B / i \neq j)} 1} \end{aligned}$ (8) where $< \cdot >$ is the cosine similarity operator applied on two vectors, $| \cdot |$ is the absolute value operator, and $B$ denotes the batch size. Note that the cosine similarity operator (Ranasinghe et al., Citation2021) used in Equations (7) and (8) is as follows: (9) $< x, y >= \frac{x \cdot y}{| | x | |_{2} \cdot | | y | |_{2}}$ (9) where $| | \cdot | |_{2}$ is the $L_{2}$ norm. In Equation (7), $s \in (- 1, 1)$ represents the average similarity of all $f_{g}$ in batch $B$ after calculating cosine similarity with each other (except itself). In Equation (8), $d \in (- 1, 1)$ represents the average similarity after calculating the cosine similarity of all $f_{g}$ and $f_{t}$ in batch $B$ (except for $f_{g}$ and $f_{t}$ with the same serial number).

To ensure the shared characteristics of the common features, the average similarity between common features should be maximised, in which case $s = 1$ ; similarly, to separate common features from text features, the average similarity between common features and text features should be minimised, in which case $d = 0$ . The above calculation method is shown in Equation (10). (10) $L_{O P L} = (1 - s) + | d |$ (10) The final loss function is a weighted combination of CE and OPL: (11) $L = L_{C E} (Y_{t r u t h}, Y_{I A O M}) + λ L_{O P L} (f_{g}, f_{t})$ (11) where $L_{C E}$ is the cross-entropy loss and $λ$ is the OPL weight hyperparameter.

Figure (a) shows the 2D distribution of text features and common features for CNN (only common features are generated, but text features are not purified), IAO (only IA and orthogonal projection operations are included), and IAOPM. As can be seen from IAOPM, OPL forces the network to separate $f_{g}$ (red)and $f_{t}$ (blue). Figure (b) shows the 2D spatial distributions are generated by IAO, IAOPM and FP-Net, with red for positive classes and blue for negative classes. Compared with IAO and FP-Net, the common features of IAOPM are significantly more aggregated. It is shown that the optimisation of common features by OPL is effective.

Figure 7. (a): Visualisation of the common features and text features of CNN, IAO, and IAOPM in 2D space. (b): Visualisation of the common features (different classes) of IAO, IAOPM, and FP-Net in 2D space.

The complete algorithm of the proposed IAOPM is given in Algorithm 1, which is self-explanatory.

Table

Display Table

4. Experiments

4.1. Experiment details

4.1.1. Dataset

We carried out experiments with the following four diverse benchmark datasets:

MR (Pang & Lee, Citation2005): This is a sentiment classification dataset, which is a collection of people's evaluations of movies, where each sample is divided into positive and negative.
SST2 (Liu et al., Citation2016): This is the Stanford Sentiment Treebank dataset, where each sample is divided into positive and negative.
CR (Hu & Liu, Citation2004): This is a sentiment classification dataset, which is a collection of customers’ reviews, where each sample is divided into positive and negative.
Subj (Pang & Lee, Citation2004): This is a question classification dataset, where each sample is divided into objective and subjective aspects.

4.1.2. Baselines

We use feature extractors common in NLP as the baseline models:

RNN (Jordan, Citation1997): Recurrent Neural Network (RNN), where the current output of a sequence is related to the previous output.
LSTM (Hochreiter & Schmidhuber, Citation1997): Long short-term memory network (LSTM) for solving the gradient disappearing problem of the traditional RNN.
GRU (Cho et al., Citation2014): Gated Recurrent Unit (GRU), which can be considered as a variant of LSTM, but it is easier to implement and compute.
CNN (Kim, Citation2014): Multiple parallel convolution kernels used to extract text local features.
BERT (Devlin et al., Citation2018): We use the bert-base-cased and fine-tuned on the trained Bert.
XLNet (Yang et al., Citation2019): We use xlnet-base-cased and fine-tuned on the trained XLNet.

We compare IAOPM with the self-attention mechanism and FP-Net based on the above baseline models.

4.1.3. Experimental parameters

In our experiments, all the word embeddings use 300-dimensional vectors trained by Word2vec (except BERT and XLNet). The specific settings for each feature extractor are shown in Table , where the CNN has a window size of [2, 3, 4]. We maintain these parameters consistent across all experiments. Additionally, all other adjustable parameters in the experiments were the optimal parameters determined through multiple trials.

Table 1. Parameter settings of feature extractors.

Download CSV Display Table

4.2. Experiment results

Firstly, we evaluate the proposed method on four widely used benchmark datasets and compare it with a self-attention mechanism and FP-Net. Our goal is to verify whether the proposed method is general and effective against different baseline models on different datasets. Note that all networks have the same basic structure (from input to FC layer), except for the components of IAOPM, self-attention mechanism and FP-Net. The experimental results are shown in Table , where X + A means the adding of self-attention mechanism on X feature extractor, X + FP means that the FP-Net uses X as the feature extractor, and X + IAOPM means the adding of an IAOPM in the X feature extractor. We measure the performance of text classification models through the highest accuracy. From Table , we can draw the following conclusions.

Table 2. Results of IAOPM, Self-Attention and FP-Net for four benchmark datasets.

Download CSV Display Table

In our experiments, both IAOPM and FP-Net consistently enhanced the performance of various baseline feature extractors, including RNN, LSTM, GRU, Bert, and XLNet. For instance, using FP-Net on RNN-MR resulted in a 1.97% improvement, while IAOPM yielded a 3.09% improvement. The only exception was when using IAOPM with XLNet-Subj, where it performed slightly worse than the self-attention mechanism. This demonstrates the effectiveness of the orthogonal projection in purifying text features.

In most experiments, IAOPM outperformed the self-attention method in improving the baseline. The only exceptions were the LSTM-CR, CNN-CR, and Bert-Subj experiments, where IAOPM performed slightly worse or equally to the self-attention mechanism. This demonstrates that IAOPM can effectively eliminate the influence of common features on representation learning by removing these features from the text, allowing the system to discover more discriminative features for object classes. Simply assigning lower weights to common features does not fully mitigate their impact on the final classification, which is why IAOPM offers more improvements compared to the self-attention mechanism.

Table shows the time overhead of self-attention mechanism, FP-Net, and IAOPM on the CNN model. As shown in Table , the time overhead of self-attention mechanism, FP-Net, and IAOPM increased by an average of 12.16%, 77.395%, and 69.93% respectively compared to the baseline CNN on the four datasets. Although FP-Net and IAOPM had a larger increase in time overhead than the self-attention mechanism, they also had a higher relative accuracy. In addition, IAOPM had a lower time overhead increase than FP-Net, while also having a higher accuracy, indicating better performance of IAOPM.

Table 3. Time overhead table.

Download CSV Display Table

In summary, IAOPM has higher accuracy and lower time overhead compared to FP-Net. This demonstrates that IAOPM is able to generate more pure features, thereby improving text classification accuracy. For practical applications, the modular structure of IAOPM applied within a single network is more flexible than the dual stream network structure of FP-Net, as it does not require the construction of an additional independent branch structure and can be rapidly applied as a plug-in module to a base network without significant changes to the overall structure.

4.3. Visual analysis

In the orthogonal projection method, the quality of common features is crucial as it can affect the effectiveness of data purification. If the quality of common features is high, it is possible to more effectively extract discriminating features from the original data, thereby improving the performance of purification. To more intuitively compare IAOPM and FP-Net, we visualise the common features extracted by both methods and analyse whether they cluster in space. This allows us to evaluate the quality of the common features and determine which method extracts common features of higher quality.

First, we visualise the common features generated by IAOPM and FP-Net in two-dimensional space (Figure (left half), where red indicates positive, and blue represent negative.). On the two datasets in two-dimensional space, the common features generated by IAOPM are more clustered due to the continual constraint on the common features by OPL during training. However, IAOPM generates two clusters on MR, as OPL typically only affects one or a few batches, and is unable to control the overall clustering of common features (Algorithm 1, shown in line 13). Therefore, if the system divides features into multiple parts during training, then the common features generated by IAOPM will also produce multiple clusters correspondingly.

Figure 8. The visualisation of common features in 2D and polar coordinate space for IAOPM and FP-Net.

Then, we also visualise the common features generated by IAOPM and FP-Net in polar coordinate space (Figure (right half), where red indicates positive, and blue represent negative.) and three-dimensional space (Figure , where red represents positive and blue represents negative.) to ensure the generalizability of the results. In polar coordinate space, the degree of clustering of the common features is reflected in the distance between them and the points. It can be observed that in polar coordinates, the common features generated by IAOPM are clearly more clustered, almost converging to one point, while the common features generated by FP-Net are more dispersed, with greater distances between them. In three-dimensional coordinate space, there is hardly any difference between the common features generated by IAOPM and FP-Net.

Figure 9. The visualisation of common features in 3D space for IAOPM and FP-Net.

According to the above experiments, IAOPM's method of generating common features through IA (shown in Algorithm 1, lines 3–8) and optimising them using OPL (shown in Algorithm 1, line 13) is superior to FP-Net.

4.4. Ablation experiments and analysis

In order to analyse the effectiveness of each component of IAOPM, we performed the ablation experiments as reported in Table . Note that in CNN + O + OPL, we use Equation (3) to replace $(A \cdot f_{t})$ with $f_{t}$ for generating common features and ensuring that the experiments can be carried out. From the results reported in Table , where CNN + X means the adding of X components to CNN, IA means inverted attention module, O means orthogonal projection module, and OPL means orthogonal projection loss. we can observe the following:

Table 4. Ablation experiments.

Download CSV Display Table

In the experiments of CNN + IA + O, we remove OPL, which means that it no longer constrains the common features. The results show that the accuracy rate is decreased by 1.6% on the CR dataset, 0.47% on the MR dataset, 1.16% on the SST2 dataset, and 0.7% on the Subj dataset. In conjunction with Figures (a) and (b), it is shown that OPL (i.e. Equation (10)) constraining the common features is valid.

In the CNN + IA and CNN + IA + OPL experiments, we removed the orthogonal projection module. Obviously, this will lose a large number of high-weight features. Compared with IAOPM, the accuracy is decreased by 19.37% (CNN + IA + O) and 21.49% (CNN + IA + OPL) in the CR dataset. In the CNN + O + OPL experiment, we removed IA and only get common features through $W_{V}$ . Compared to IAOPM, the accuracy drops by 1.6%. The above experiments show that it is effective and feasible to generate common features through IA and purify combined text features with orthogonal projection.

These results indicate that each component in IAOPM is important and the absence of any one component will result in a loss of accuracy.

Finally, with all other parameters held constant, we conducted relevant experiments on the threshold parameter $T$ in formula 2, as shown in Table , where hyperparameters $T$ (Equation (2)) is expressed as a percentage. For example, T = 10 means removing the top 90% of the text features, T = 90 means removing the top 10% of the text features. As shown in Table , it is generally recommended to set the $T$ for sequence models such as RNN and LSTM between 20% and 30%, while the $T$ for CNN is recommended to be set between 60% and 70%. However, for the best purification performance, the threshold should be adjusted according to the specific situation. This characteristic of being able to flexibly adjust parameters according to the actual situation to control the purification effect also indicates the potential for IAOPM to have certain industrial applications.

Table 5. Hyperparametric $T$ experiment.

Display Table

5. Conclusion

In this study, we proposed the Inversed Attention Orthogonal Projection Module (IAOPM). This method extracts common features by iteratively reversing the attention distribution map of text features using inverse attention, and ensures the quality of these common features through OPL. Finally, the text features are purified using orthogonal projection. Compared to the original orthogonal projection, IAOPM does not add any branches and is more flexible without sacrificing performance. Through a large number of comparison experiments, it was proved that the proposed method performs better than the baseline model, self-attention mechanism, and original orthogonal projection method on multiple text classification datasets, with an average accuracy increase of 1.02%, 0.44%, and 0.52%, respectively. Currently, our method is only suitable for traditional text classification methods such as sequence models (RNN, LSTM, GRU, etc.), Bert, etc. and can only ensure the sharing of common features under one or a few small batches. In the next step, we will try to guarantee the sharing of common features as a whole and further reduce the number of parameters.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Correction Statement

This article has been corrected with minor changes. These changes do not impact the academic content of the article.

Additional information

Funding

This work is supported by the National Science Foundation of China [grant number62166025], and the Science and technology project of Gansu Province [grant number 21YF5GA073].

References

Blitzer, J., McDonald, R., & Pereira, F. (2006). Domain adaptation with structural correspondence learning. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, 120–128.
Google Scholar
Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv:1406.1078.
Google Scholar
Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q. V., & Salakhutdinov, R. (2019). Transformer-xl: Attentive language models beyond a fixed-length context. arXiv:1901.02860. https://doi.org/10.48550/arXiv.1901.02860.
Google Scholar
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805.
Google Scholar
Ganin, Y., & Lempitsky, V. (2015). Unsupervised domain adaptation by backpropagation. In International conference on machine learning (pp. 1180–1189). PMLR.
Google Scholar
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
PubMed Web of Science ®Google Scholar
Hu, M., & Liu, B. (2004). Mining and summarizing customer reviews. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 168–177. https://doi.org/10.1145/1014052.1014073
Google Scholar
Hu, Y. L., & Zhao, Q. S. (2021). Bi-GRU model based on pooling and attention for text classification. International Journal of Wireless and Mobile Computing, 21(1), 26–31. https://doi.org/10.1504/IJWMC.2021.119057
Google Scholar
Huan, H., Guo, Z., Cai, T., & He, Z. (2022). A text classification method based on a convolutional and bidirectional long short-term memory model. Connection Science, 34(1), 2108–2124. https://doi.org/10.1080/09540091.2022.2098926
Web of Science ®Google Scholar
Huang, Z., Ke, W., & Huang, D. (2020). Improving object detection with inverted attention. In 2020 IEEE winter conference on applications of computer vision (WACV) (pp. 1294–1302). IEEE.
Google Scholar
Jing, R. (2019). A self-attention based LSTM network for text classification. Journal of Physics: Conference Series, 1207(1), 012008. https://doi.org/10.1088/1742-6596/1207/1/01200
Google Scholar
Jordan, M. I. (1997). Serial order: A parallel distributed processing approach. Advances in Psychology, 121, 471–495. https://doi.org/10.1016/S0166-4115(97)80111-2
Google Scholar
Kim, Y. (2014). Convolutional neural networks for sentence classification. In Proc. EMNLP, 1746–1751. https://doi.org/10.3115/v1/D14-1181
Google Scholar
Li, Z., Zhang, Y., Wei, Y., Wu, Y., & Yang, Q. (2017). End-to-End adversarial memory network for cross-domain sentiment classification. In IJCAI, 2237–2243. https://doi.org/10.24963/ijcai.2017/311
Google Scholar
Liu, P., Qiu, X., & Huang, X. (2016). Recurrent neural network for text classification with multi-task learning. arXiv:1605.05101.
Google Scholar
Liu, R., Cao, J., Sun, N., & Jiang, L. (2022). Aspect feature distillation and enhancement network for aspect-based sentiment analysis. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, 1577–1587.
Google Scholar
Lv, H., Ning, Y., Ning, K., Ji, X., & He, S. (2022). Chinese Text Classification Using BERT and Flat-Lattice Transformer. In International Conference on AI and Mobile Services, 64–75. https://doi.org/10.1007/978-3-031-23504-7_5
Google Scholar
Mnih, V., Heess, N., & Graves, A. (2014). Recurrent models of visual attention. Advances in Neural Information Processing Systems, 27. https://doi.org/10.48550/arXiv.1406.6247
Google Scholar
Pan, S. J., Ni, X., Sun, J. T., Yang, Q., & Chen, Z. (2010). Cross-domain sentiment classification via spectral feature alignment. In Proceedings of the 19th International Conference on World Wide web, 751–760. https://doi.org/10.1145/1772690.1772767
Google Scholar
Pang, B., & Lee, L. (2004). A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. arXiv: cs/0409058.
Google Scholar
Pang, B., & Lee, L. (2005). Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. arXiv: cs/0506075.
Google Scholar
Qin, Q., Hu, W., & Liu, B. (2020). Feature projection for improved text classification. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 8161–8171. https://doi.org/10.18653/v1/2020.acl-main.726
Google Scholar
Ranasinghe, K., Naseer, M., Hayat, M., Khan, S., & Khan, F. S. (2021). Orthogonal projection loss. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 12333–12343. https://doi.org/10.48550/arXiv.2103.14021
Google Scholar
Ruan, J., Caballero, J. M., & Juanatas, R. A. (2022). Chinese news text classification method based On attention mechanism. In 2022 7th international conference on business and industrial research (ICBIR) (pp. 330-334). IEEE.
Google Scholar
Shaheen, Z., Wohlgenannt, G., & Filtz, E. (2020). Large scale legal text classification using transformer models. arXiv: 2010.12871.
Google Scholar
Sivakumar, S., & Rajalakshmi, R. (2021). Analysis of sentiment on movie reviews using word embedding self-attentive LSTM. International Journal of Ambient Computing and Intelligence, 12(2), 33–52. https://doi.org/10.4018/IJACI.2021040103
Web of Science ®Google Scholar
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30, https://doi.org/10.48550/arXiv.1706.03762
Google Scholar
Wang, C., & Zhang, F. (2022). The performance of improved XLNet on text classification. In Third International Conference on Artificial Intelligence and Electromechanical Automation (AIEA 2022), 12329, 154–159.
Google Scholar
Wei, S., Zhu, G., Sun, Z., Li, X., & Weng, T. (2022). GP-GCN: Global features of orthogonal projection and local dependency fused graph convolutional networks for aspect-level sentiment classification. Connection Science, 34(1), 1785–1806. https://doi.org/10.1080/09540091.2022.2080183
Web of Science ®Google Scholar
Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R. R., & Le, Q. V. (2019). Xlnet: Generalized autoregressive pretraining for language understanding. Advances in Neural Information Processing Systems, 32. https://doi.org/10.48550/arXiv.1906.08237
Google Scholar
Yu, J., & Jiang, J. (2016). Learning sentence embeddings with auxiliary tasks for cross-DomainSentiment classification. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 236–246. https://doi.org/10.18653/v1/D16-1023
Google Scholar
Zhang, K., Zhang, H., Liu, Q., Zhao, H., Zhu, H., & Chen, E. (2019). Interactive attention transfer network for cross-domain sentiment classification. Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 33(1), pp. 5773-5780). https://doi.org/10.1609/aaai.v33i01.33015773
Google Scholar
Zhou, Y., Li, J., Chi, J., Tang, W., & Zheng, Y. (2022). Set-CNN: A text convolutional neural network based on semantic extension for short text classification. Knowledge-Based Systems, 257, 109948. https://doi.org/10.1016/j.knosys.2022.109948
Web of Science ®Google Scholar

An enhanced text classification model by the inverted attention orthogonal projection module

ABSTRACT

1. Introduction