Full article: Employing synthetic data for addressing the class imbalance in aspect-based sentiment classification

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

ABSTRACT

The class imbalance problem, in which the distribution of different classes in training data is unequal or skewed, is a prevailing problem. This can lead to classifier algorithms being biased, negatively impacting the performance of the minority class. In this paper, we addressed the class imbalance problem in datasets for aspect-based sentiment classification. Aspect-based Sentiment Classification (AbSC) is a type of fine-grained sentiment analysis in which sentiments about particular aspects of an entity are extracted. In this work, we addressed the issue of class imbalance by creating synthetic data. For synthetic data generation, two techniques have been proposed: paraphrasing using the PEGASUS fine-tuned model and backtranslation using the M2M100 neural machine translation model. We compared these techniques with two other class balancing techniques, such as weighted oversampling and cross-entropy loss with class weight. An extensive experimental study has been conducted on three benchmark datasets for restaurant reviews: SemEval-2014, SemEval-2015, and SemEval-2016. We applied these methods to the BERT-based deep learning model for aspect-based sentiment classification and studied the effect of balancing the data on the performance of these models. Our proposed balancing technique, using synthetic data, yielded better results than the other two existing methods for dealing with multi-class imbalance.

KEYWORDS:

1. Introduction

There has been a remarkable shift in recent years towards online shopping and service bookings. Due to the widespread availability of the internet, users are leaving feedback and reviews of the things they have purchased and the services they have received on different social platforms. Many of these reviews have rich sentiments and mixed opinions revealed implicitly. This review data has a lot of potential and can be a great resource for consumers and businesses. Consumers can use other consumers' reviews to determine the quality of a product or service and use them as a reference while opting for it. Furthermore, these reviews help a business understand how people feel about their brand, product, or service on social media. However, due to the increasing number of online reviews, manually viewing and reading all customer reviews to extract useful information is tough. Hence, there is a need to analyze and evaluate the information with the help of automated systems. Sentiment Analysis (Pang & Lee, Citation2008) is a solution to this problem. The term ‘Sentiment Analysis (SA)’ refers to the analysis of emotions, sentiments and opinions expressed within text data. Sentiment analysis uses text mining and other approaches to find and obtain subjective information from texts for determining the content's sentiment polarity (negative, positive, or neutral).

Sentiment analysis is studied mainly at 3 levels: sentence-level, document-level, and aspect-level. In document-level (Tripathy et al., Citation2017) sentiment analysis, the entire document is analyzed and the sentiment expressed in it is extracted. The complete document is assumed to express only one sentiment. For instance, the following review contains multiple sentences with different polarities: ‘The food was really delicious. I had the onion soup and it was one of the best ever. But I was highly disappointed by their slow service’ is classified as neutral in document-level sentiment analysis. In sentence-level SA (Liu, Citation2010), the aim is to extract sentiment expressed in a single sentence. For example, the sentence ‘But I was highly disappointed by their slow service’ is classified as negative, while the sentence ‘The food was really delicious. I had the onion soup and it was one of the best ever’ is classified as having a positive polarity. Documents typically contain multiple sentences expressing differing opinions about various aspects of the same entity. In document-level sentiment analysis, a single opinion score is assigned to the document as a whole, which may or may not provide meaningful information. On the other hand, many times, multiple entities or different aspects of an entity are frequently compared within the same statement, or sentiments expressed in a single sentence are contrasted. Document-level and sentence-level sentiment analysis may fail to extract the sentiments expressed precisely in the scenarios described above. Sentiment analysis at the aspect level (Pontiki et al., Citation2014), on the other hand, seeks sentiment expressed toward a certain aspect in a text. For instance, in the sentence ‘ This place is really trendi but they have forgotten about the most important part of a restaurant, the food’, the sentiment of the aspect ‘place’ is positive but the sentiment of the aspect ‘food’ is negative.

Aspect-based sentiment classification (AbSC) (Pontiki et al., Citation2014) is a type of fine-grained sentiment analysis wherein we extract sentiments about specific aspects of an entity. This includes two subtasks: aspect term extraction and sentiment classification towards that aspect. For example, in the restaurant review, ‘The food was great but the service is dreadful’, two aspect terms, ‘food’ and ‘service’ are identified in the first subtask. In the second subtask, the aspects are assumed to be known in advance and we find sentiment for each aspect. The sentiment associated with the aspect terms ‘food’ and ‘service’ in the example above is positive and negative, respectively.

In this study, we primarily consider the existing techniques for aspect-based sentiment classification where aspects are predetermined. Different dictionary based, machine-learning (Ghiassi & Lee, Citation2018; F. Tang et al., Citation2019) and deep-learning based (F. Fan et al., Citation2018; Ma et al., Citation2017; Song et al., Citation2019; D. Tang et al., Citation2015; Y. Wang, Huang, et al., Citation2016; Zeng et al., Citation2019) approaches are proposed for aspect-based sentiment classification. The majority of these studies use SemEval Datasets (Pontiki et al., Citation2016, Citation2015, Citation2014) as benchmark datasets. These datasets include restaurant reviews labelled as positive, negative, or neutral. The details of these datasets are discussed in Section 3 of this paper. After studying the statistics of these datasets, we found that the distribution of polarities of samples in the restaurant dataset is not balanced. The number of samples from one class is significantly greater than the number of samples from other classes. Such datasets are referred to as imbalanced datasets.

The majority class is the class that has a larger number of samples, and the minority class is the class that has a relatively smaller number of examples. The classifier models become biased towards the majority class if the training data is not balanced. The class imbalance problem is well studied for image datasets (Johnson & Khoshgoftaar, Citation2019), but limited work is available for text datasets. In the real world, people's expressed sentiments represent whether they loved or disliked a product or service; hence, the number of positive or negative sentiment sentences will be significantly higher than in the other classes. In this work, we have addressed the class imbalance problem in aspect-based sentiment analysis datasets using synthetic data.

By analyzing the dataset statistics, the minority classes are identified. We computed the number of samples needed for each minority class to balance the class distribution in the dataset and generated the corresponding amount of synthetic samples. These new samples are incorporated into the original dataset. We have used two methods for the generation of pseudo-data, paraphrasing and backtranslation. PEGASUS fine-tuned model (Zhang et al., Citation2020) and M2M100 neural machine translation model (A. Fan et al., Citation2021) used for paraphrasing and backtranslation respectively. The main contributions of this work are summarized below:

This paper addresses the class imbalance problem in aspect-based sentiment classification datasets by using synthetic data.
To balance class distribution, we generated new minority class examples using two methods: paraphrasing and backtranslation.
We compared our proposed methods to weighted oversampling and cost-sensitive learning as balancing strategies at the data and algorithm levels, respectively.
Extensive experimental study is carried out on the widely used SemEval-2014, SemEval-2015 & SemEval-2016 benchmarks datasets.
The effect of class balancing on the performance of the Bert-based model for aspect-based sentiment analysis is studied.

The rest of the paper is organized as follows: related works about aspect-based sentiment classification and handling class imbalance are discussed in Section 2. In Section 3, the details of the datasets used in our study are provided. The methods we have used for class balancing are described in Section 4. Sections 5–7 describe experiments conducted, results & discussion and conclusion respectively.

2. Related work

The objective of this work is to apply methods for handling imbalanced data in datasets used for aspect-based sentiment classification. Hence, we have studied existing work done in aspect-based sentiment classification and class-imbalance handling methods. The following subsections cover existing studies for aspect-based sentiment classification and imbalanced data handling respectively.

2.1. Aspect-based sentiment classification methods

The main steps in aspect-based sentiment classification are aspect extraction and finding sentiment towards that aspect. From the existing studies, some are focusing only on aspect extraction and others only on finding sentiment for a given aspect. Many joint methods are also available. In this paper, we focus on sentiment classification for aspects that have already been extracted. Therefore, we have listed studies related to sentiment classification only. Various approaches are proposed for aspect-based sentiment classification, which mainly include traditional machine learning and deep learning methods.

Early works on aspect-based sentiment classification relied on dictionaries to identify the sentiment of individual words, and the sentiment is then assigned to the aspect by aggregating the sentiment of the surrounding words. Later approaches are based on supervised and unsupervised machine learning methods. Schouten and Frasincar (Citation2015) presented a detailed review of different approaches for aspect detection and sentiment classification. In supervised approaches, lexicons information is extracted from training data and then this information is used to train the classifier. In the unsupervised approach, aspect is used to find potential sentiment phrases, and the sentiment phrases showing a positive or negative sentiment are retained.

Recently, Deep Neural Network (DNN) methods are popularly used as they do not need feature engineering. In DNN-based methods, the input text is represented as continuous low-dimensional vectors called word embeddings (Bengio et al., Citation2003) and this vector is fed to the first hidden layer of DNN. These word embeddings are trained on a huge text corpus. Glove (Pennington et al., Citation2014) and Word2vec (Mikolov et al., Citation2013), are examples of pretrained word embeddings.

It is observed that the attention mechanism improved the performance of most DNN-based models where attention weights are calculated considering the correlation between aspect and context. The Vanishing gradient problem of Recurrent Neural Networks (RNN) is solved by using Long Short-Term Memory (LSTM) (Sutskever et al., Citation2014). TD-LSTM (D. Tang et al., Citation2015) and ATAE-LSTM (Y. Wang, Huang, et al., Citation2016) are examples of LSTM-based models. Previous aspect-based Sentiment Classification works considered aspect as independent information. Interactive learning of aspect words and context is introduced by Ma et al. (Citation2017). A multigrained attention network, MGAN (F. Fan et al., Citation2018) is introduced, which uses an attention mechanism to calculate the interaction between aspect and context at the word level.

Recently, the use of pre-trained language models like ELMo (Peters et al., Citation2018) and BERT (Devlin et al., Citation2018) have significantly improved the performance of aspect-based sentiment classification tasks. BERT-SPC (Devlin et al., Citation2018) is the BERT text pair classification model, where the original BERT model is adapted for aspect-based sentiment classification. AEN-BERT (Song et al., Citation2019) is an attention-based model which uses encoders based on attention for the modelling between context and target. The multi-head self-attention-based model, LCF-BERT is proposed in Zeng et al. (Citation2019) in which the correlation between the sentiment polarity and local context is focused. They have added additional layers to give greater attention to the local context word.

2.2. Methods for handling class-imbalance

In the presence of imbalanced data, classifiers will often over-classify the majority class because of the number of samples in the training data, while samples of the minority class will be misclassified more frequently. The classifier model will be completely biased towards the majority class. Furthermore, in evaluation metrics such as accuracy, the goal is to minimize overall error to which the contribution of the minority class is little. Many solutions to handle imbalanced data have been proposed in previous studies at the data-level and algorithm-level (Krawczyk, Citation2016). At the data-level, the dataset is balanced using resampling methods like oversampling and undersampling or by generating synthetic data. In oversampling, the minority class samples are duplicated randomly. Undersampling is the process of removing samples from the majority class in order to balance the distribution of the class. SMOTE and its variants (Chawla et al., Citation2002; Han et al., Citation2005) are well-known data-level approaches for dealing with class imbalance. In algorithm-level methods, the training data distribution is not altered. Instead, the learning algorithm is modified to provide the minority class more importance. Algorithms are modified considering the class weights, or the decision threshold is adjusted to reduce bias towards the majority class. Cost-sensitive learning, adjusting the decision threshold are some algorithm-level methods to deal with imbalanced data. And hybrid methods combine both sampling and algorithmic methods. Machine learning techniques for handling class imbalance have been widely studied in the past. An overview of these methods is discussed in Ganganwar (Citation2012).

In recent years, deep learning techniques for handling class imbalance have been successfully implemented in various domains. Johnson and Khoshgoftaar (Citation2019) describe different deep learning approaches for addressing class imbalance. In data-level methods, a new sampling method is introduced by Pouyanfar et al. (Citation2018) where sampling rates are adjusted as per class-wise performance. Algorithm-level approaches to the class imbalance problem include cost-sensitive learning, new loss functions, and threshold moving. S. Wang, Liu, et al. (Citation2016) proposed 2 new loss functions, ‘mean squared false error (MSFE)’ and ‘mean false error (MFE).’ These loss functions give more attention to errors coming from the minority class. Lin et al. (Citation2017) proposed focal loss in which the cross-entropy (CE) loss is modified to reduce the effect of easily classified examples on the loss.

Data augmentation (DA) refers to techniques for increasing the diversity of training data without actually gathering more data. The most common application of DA is to prevent overfitting. In comparison to computer vision, data augmentation in NLP has received relatively little attention. One application of data augmentation is to handle class imbalance by increasing minority class samples. Synonym Replacement, SeqMixUp & Random Swap are examples of DA methods (Connor et al., Citation2021; Feng et al., Citation2021). Wei and Zou (Citation2019) introduced Easy Data Augmentation for increasing effectiveness of text classification. EDA has four basic yet effective operations: random insertion, random deletion, random swap and synonym replacement. The popular backtranslation method (Sennrich et al., Citation2016) translates a sequence into some other language and then back into the original language. In this work, we have addressed class imbalance by augmenting minority class samples using two augmentation methods, paraphrasing and backtranslation. We studied the effectiveness of these methods versus weighted oversampling and cost-sensitive learning methods.

3. Dataset description

In our study we used SemEval-2014 (Pontiki et al., Citation2014), SemEval-2015 (Pontiki et al., Citation2015), and SemEval-2016 (Pontiki et al., Citation2016) datasets from the Restaurant domain. These are benchmark datasets for aspect-based sentiment classification. SemEval (Semantic Evaluation) is a series of NLP research workshops. Aspect-based sentiment classification is one of the SemEval competition tasks, and the objective is to find the aspects of provided target entities and extract the sentiment expressed towards each aspect. Each dataset contains review sentences, aspect terms and their polarities, aspect categories and their polarities. An XML snippet of the Semeval 2014 Restaurant Dataset is shown in .

Figure 1. XML snippet of SemEval2014 restaurant Dataset.

The statistics of the SemEval restaurant review datasets are shown in . There are 2164 positive samples, 807 negative samples, and 637 neutral samples in the SemEval 2014 dataset. clearly shows that the number of positive samples is higher than the number of negative and neutral samples, resulting in an imbalanced data distribution. The graph of class distribution is shown in .

Figure 2. Class distribution in SemEval datasets.

Table 1. Statistics of the SemEval Restaurant review datasets.

Download CSV Display Table

The last column in specifies the Imbalance Ratio, which represents the level of imbalance in the dataset. Imbalance Ratio ρ (Johnson & Khoshgoftaar, Citation2019) is defined in Equation (Equation1(1) $ρ = \frac{max_{i} {| C_{i} |}}{min_{i} {| C_{i} |}}$ (1) ). (1) $ρ = \frac{max_{i} {| C_{i} |}}{min_{i} {| C_{i} |}}$ (1) IR ratio ρ (Equation (Equation1(1) $ρ = \frac{max_{i} {| C_{i} |}}{min_{i} {| C_{i} |}}$ (1) )) indicates the maximum between-class imbalance level. $C_{i}$ is a number of examples in class i, $min_{i} {| Ci |}$ and $max_{i} {| Ci |}$ returns the minimum and maximum class size from all i classes, respectively. All of these datasets are unbalanced, with an imbalance ratio ranging from 3.39 to 18. In all datasets, the number of positive samples significantly outnumbers that of negative and neutral samples.

4. Methodology

In this paper, we addressed the class imbalance problem in aspect-based sentiment classification datasets. We have proposed the multi-class balancing Technique by generating synthetic data. The architecture of the proposed system is shown in .

Figure 3. Architecture of the proposed system.

In the first step, the imbalance level () in the dataset is determined, and the necessary number of synthetic data for the minority class is generated. For example, in the SemEval 2014 dataset, the positive class is the majority class, whereas the negative and neutral classes are minority classes. The ratio of positive to negative classes is 3.39. As a result, three instances are generated for each negative instance. The number of minority instances needed to obtain a balanced dataset are randomly selected from the created synthetic data, according to the difference between the counts of the majority class and minority class.The datasets we used in this work contain three classes: positive, negative, and neutral. We calculated the imbalance level for each class (see ). If the imbalance is equal to one, no new samples are needed. We generated synthetic samples for minority classes if the imbalance was greater than one.

In our proposed method, we have generated new samples of negative and neutral classes using the paraphrasing and backtranslation method to balance the datasets. The method is described in detail in Section 4.1. After balancing the dataset, we generated BERT word embeddings for review statement and aspect term, and then these embeddings are provided to the BERT-based model as a sentence pair. The details of the model are provided in Section 4.2.

To demonstrate the effectiveness of our method, we compared its performance to that of existing data-level and algorithm-level methods. Weighted oversampling is the data-level method we used for comparison. In weighted oversampling, the word embeddings for text data is generated. During the training phase, the samples are assigned probabilities depending on their class weights and the samples for a batch are drawn accordingly. Because of the higher class weight, the likelihood of getting minority class samples in the batch is higher. Thus, the model will not be biased towards the majority class. In the second method namely cross-entropy loss with class weights, the loss function is modified in such a way that loss from the minority class is more, so more attention is given to learning the minority class. Details of each method used are provided in Sections 5.3 and 5.4.

4.1. Proposed class-balancing approach using synthetic data

Data augmentation (DA) refers to techniques for increasing training data diversity without collecting additional data. Most approaches either add slightly altered copies of existing data or produce synthetic data, with the aim of preventing overfitting while training the models (Connor et al., Citation2021). Data augmentation is mainly used to solve the problem of the limited amount of annotated data. In our work, we have used data augmentation to address issues caused by class imbalance. We have generated synthetic minority class samples using two paraphrasing methods. Details of these methods are described in the following subsections.

4.1.1. Paraphrasing using PEGASUS transformer model

The process of rephrasing a text while maintaining its semantics is known as paraphrasing. In the literature, a variety of strategies were employed to produce paraphrases while maintaining two key factors: semantic similarity and text diversity.

The first technique we used for balancing the dataset is generating new minority class samples by paraphrasing the original minority class samples in the dataset. We have used PEGASUS transformers (Zhang et al., Citation2020) for paraphrasing. PEGASUS (Pre-training with Extracted Gap-sentences for Abstractive SUmmarization Sequence-to-sequence models) is a Transformer-based encoder-decoder model. The pre-training task for the PEGASUS model is quite similar to summarizing; important sentences are eliminated and masked from an input document, and the model will generate the masked sentences from the remaining sentences. Many complete sentences are deleted from documents in PEGASUS' pre-training phase, and the model is tasked with retrieving them. In pre-training, a document with some missing sentences is provided as input, and the output contains concatenated missing sentences. We have used the PEGASUS fine-tuned model for paraphrase generation from huggingface.Footnote¹

We first calculate the maximum between-class imbalance level (IL) as per Equation (Equation1(1) $ρ = \frac{max_{i} {| C_{i} |}}{min_{i} {| C_{i} |}}$ (1) ). For each minority class statement, the paraphrased statements equal to the maximum between-class imbalance level (IL) are generated in order to balance the dataset. The paraphrased text which does not contain the aspect term in the input text is removed. The duplicate paraphrased text is also removed. Following is an example of paraphrased text generated for the given input text, and the number of return sequences is set to 5.

Input text to paraphrase: ‘The food was lousy -- too sweet or too salty and the portions tiny.’

Paraphrased text:

‘The food was not good, it was too sweet or salty and the portions were small.’
‘The food was not good and the portions were small.’
‘The food was bad, too sweet or salty, and the portions were small.’
‘The food was not good, it was too sweet or salty, and the portions were tiny.’
‘The food was too sweet or salty and the portions were small.’

4.1.2. Backtranslation using NMT model

Paraphrasing using back translation seems to be very promising because back-translation models were able to generate several diverse paraphrases while preserving the semantics of the text. M2M100 (A. Fan et al., Citation2021) is an encoder-decoder (seq-to-seq) model that's been specifically trained to do translations between multiple languages. We have systematically selected a set of 35 languages as an intermediate language set used for backtranslation. We grouped all the languages supported by the M2M100 model into their corresponding language families. Then we picked the most spoken languages around the world from each language family. The displays a list of selected languages and their corresponding language families.

Table 2. Languages selected for backtranslation.

Download CSV Display Table

In the first step, each review sample is translated into the target language selected randomly from the set L using neural machine translation model M2M100 (A. Fan et al., Citation2021). In the second step, it is translated back to the source language. demonstrates the process of backtranslation. In the example shown in , the review statement is translated into Dutch, German, Arabic and Gujarati languages and then back to English. Four reviews are generated from one review statement. The imbalance ratio of the dataset determines how many minority class samples should be generated.

Figure 4. Backtranslation-Example.

4.2. Aspect-based sentiment classification model

In recent times, the use of pre-trained language models, like BERT (Devlin et al., Citation2018), has significantly improved the performance of aspect-based sentiment classification tasks. In 2018, Google Research proposed a Natural Language Processing Model called BERT (Bidirectional Encoder Representations from Transformers). A transformer architecture is an encoder-decoder network that uses self-attention on the encoder side and attention on the decoder side. In BERT, the encoder stack of transformer architecture is used. BERT has been pre-trained on two separate but related NLP tasks: Masked Language Modeling and Next Sentence Prediction.

In this work, the original pre-trained BERT model is fine-tuned by adding one additional layer for the aspect-based sentiment classification task. The input sequence is prepared by appending aspects to contexts, treating aspect and context as two segments. The model architecture is shown in .

Figure 5. BERT model for aspect-based sentiment classification.

Consider a review text $S = {W_{1}, \dots, W_{n}}$ which consists of a sequence of n tokens. The aspect set $A = {a_{1}, \dots, a_{x}}$ , is a part of sentence S and x is the number of aspect terms. The sentiment polarity $P = {positive, neutral, negative}$ is associated with each aspect term in A. The aspect term may contain multiple words. The objective of the Aspect-based Sentiment Classification task is to predict the polarity of sentence S with respect to Aspect Term A. BERT accepts input using two unique tokens, [CLS] and [SEP]. Tokens [CLS] and [SEP] are added at the beginning and end of the input, respectively. In the case of sentence pairs, to specify the end of the first input and the beginning of the second input, [SEP] is inserted at the end of the first sentence. In this aspect-based sentiment classification task, for instance, the review text is - ‘The food was great but the service is dreadful’ and the aspect terms are food and service, then the input statement will be given twice, one for each aspect term. first input will be ‘[CLS] The food was great but the service is dreadful [SEP] food [SEP]’ and the second will be ‘[CLS] The food was great but the service is dreadful [SEP] service [SEP].’ For each token, BERT uses positional and segment embeddings in addition to token embeddings. Positional embeddings contain information on the order of tokens. When the model input contains sentence pairs, segment embeddings are useful. Tokens from the first sentence will have a segment embedding as 0 and tokens from the second sentence will have a segment embedding as 1.

5. Experiments

We conducted four experiments to analyze the impact of data balancing on the performance of an aspect-based sentiment classification model. The objectives of our work are as follows:

To study the performance of the model using weighted oversampling, the data-level balancing method.
To analyze the impact of balancing the dataset using a modified cross-entropy loss function using class weights.
To investigate the impact of employing the proposed class balancing technique, in which synthetic data is generated using the PEGASUS transformer model.
To examine how proposed class balancing strategy based on backtranslation, affects the model's performance and compare the outcomes with the three methods mentioned above.

The following subsections provide a description of hyperparameters, evaluation metrics and details of each experiment.

5.1. Hyper-parameters

The hyperparameters are determined by conducting a large number of comparative experiments. Some of the hyperparameters followed the widely accepted settings for this specific task, such as the 80-20 train test split and the number of epochs as 10. The fine-tuning of BERT is highly sensitive to the learning rate; the BERT paper (Devlin et al., Citation2018) showed that a smaller learning rate can maximize its performance. In the experiments, we discovered that bigger batch sizes could decrease the model's regularization stability, leading to poorer results. Thus, the optimal batch size of 16 was chosen. Large dropout will slow the model's convergence rate, Therefore, after comprehensive consideration, dropout of the model is set to 0.1 and learning rate to $2 \times 10^{- 5}$ .

5.2. Evaluation metrics

Accuracy and F1-score are commonly used metrics for evaluating the performance of classification models.

5.2.1. Accuracy

Accuracy is defined as in Equation (Equation2(2) $Accuracy = \frac{correct predictions}{total predictions}$ (2) ). (2) $Accuracy = \frac{correct predictions}{total predictions}$ (2) But this evaluation measure is not appropriate for imbalanced classification. The reason is that the model that predicts only the majority class correctly achieves high accuracy. In an imbalanced dataset, minority class examples are important, but the misclassification of these examples has less impact on accuracy.

5.2.2. F1-Score

The F1-Score is a useful evaluation metric for imbalanced datasets. It is also known as F-Measure. F1-Score is calculated by combining Precision(P) and Recall(R) as in Equation (Equation3(3) $F 1 - Score = \frac{2 * P * R}{P + R}$ (3) ). (3) $F 1 - Score = \frac{2 * P * R}{P + R}$ (3) where Precision (P) and Recall (R) are calculated as in Equations (Equation4(4) $P = \frac{TP}{TP + FP}$ (4) ) and (Equation5(5) $R = \frac{TP}{TP + FN}$ (5) ) respectively. True Positives(TP) represent the number of samples correctly predicted as ‘positive.’ False Positives (FP) is the number of samples wrongly predicted as ‘positive’ and False Negatives (FN) represent the number of samples wrongly predicted as ‘negative.’ (4) $P = \frac{TP}{TP + FP}$ (4) (5) $R = \frac{TP}{TP + FN}$ (5) Because there are three classes, the F1-score is calculated for each class in a one-vs-rest fashion, and the macro-average is computed.

5.3. Experiment 1: weighted oversampling

In this method, each sample is assigned a probability of being sampled. The probability is defined by its class with the given weight parameter. As shown in Equation (Equation6(6) $weigh t_{i} = \frac{1}{Number of samples of clas s_{i}}$ (6) ), class weights are calculated as the reciprocal of the number of items in each class. As a result, a higher weight is assigned to the minority class. (6) $weigh t_{i} = \frac{1}{Number of samples of clas s_{i}}$ (6) By normalizing the weight vector, the probabilities are computed. Each class's probability is assigned to all of the samples in that class. Because of the higher probability, the likelihood of drawing a minority class sample is higher when drawing samples for a batch. The class distribution of samples for 15 batches (each of size 16) before and after oversampling is shown in . In the first figure, in each batch, the number of positive samples drawn is greater than the number of negative samples, so the model will be trained with a higher number of positive samples in each batch. After weighted oversampling, as we can see in , the probability of drawing a minority class sample is higher. As a result, the model is trained on difficult-to-learn minority class examples as well.

Figure 6. Class Distribution before and after weighted oversampling.

In this oversampling method for handling class imbalance, we do not generate new samples of the minority class, but we assign a higher probability to minority class samples to ensure that these samples are drawn more frequently than majority class samples. The results after using weighted oversampling are shown in . After applying probabilities based on class weights to each sample, there is an improvement in the F1-score for all datasets. displays the graph of the F1-scores obtained for each dataset before and after balancing using weighted oversampling.

Figure 7. Performance of AbSC model using Weighted Oversampling.

Table 3. Experimental results.

Download CSV Display Table

5.4. Experiment 2: cross-entropy loss with class weights

Cross-entropy loss (CE Loss) is a commonly used loss function in machine learning for optimizing classification models. Equation (Equation7(7) $CE Loss = - \sum_{i = 1}^{n} t_{i} * \log (p_{i}), for n classes$ (7) ) describes the cross-entropy loss where $p_{i}$ is the softmax probability and $t_{i}$ is the truth label for $i^{th}$ class. (7) $CE Loss = - \sum_{i = 1}^{n} t_{i} * \log (p_{i}), for n classes$ (7) In the cross-entropy loss function, the loss calculated from all the class samples is given equal weight. In this method, the cross-entropy loss function is modified by including class weights. For handling imbalanced class distribution in aspect-based sentiment classification datasets, we have multiplied cross-entropy loss with class weights so that more attention is given to the loss coming from the minority class. The class weights are calculated according to Equation (Equation8(8) $Classweigh t_{i} = \frac{Total No . of samples}{No . of samples of clas s_{i} * No . of classes}$ (8) ) (8) $Classweigh t_{i} = \frac{Total No . of samples}{No . of samples of clas s_{i} * No . of classes}$ (8) Cross-entropy loss with class weight is calculated as in Equation (Equation9(9) $CE Loss with CW = - \sum_{i = 1}^{n} Classweigh t_{i} * t_{i} * \log (p_{i}),$ (9) ). In Equation (Equation9(9) $CE Loss with CW = - \sum_{i = 1}^{n} Classweigh t_{i} * t_{i} * \log (p_{i}),$ (9) ), n represents the number of classes in dataset. The minority class has a larger class weight than the majority class, so the loss from the class with fewer samples is higher, according to this equation. (9) $CE Loss with CW = - \sum_{i = 1}^{n} Classweigh t_{i} * t_{i} * \log (p_{i}),$ (9) shows the F1-score for all three datasets before and after balancing using cross-entropy loss with class weight, and shows the graph.

Figure 8. Performance of AbSC model using CE loss with class weights.

5.5. Experiment 3: class balancing using PEGASUS paraphrasing model

In this experiment, synthetic data is generated by paraphrasing the minority class samples from the dataset. For paraphrasing, the PEGASUS finetuned model for paraphrase generation is used. To create a balanced dataset, we first computed the difference between majority and minority class samples and the number of minority class samples that would be produced. Then calculated how many times each minority sample needed to be augmented and accordingly paraphrased statements are generated for each minority class sample. Using this class-balancing approach has improved the performance of the model for all the datasets (). The F1-scores for all the datasets before balancing and after balancing using paraphrasing are shown in .

Figure 9. Performance of AbSC model using paraphrasing class balancing approach.

5.6. Experiment 4: class balancing using backtranslation

This is the data-level method for handling class imbalance, where we generated new minority class samples using backtranslation method. We initially determined the difference between the majority and minority class samples and the required number of minority class samples in order to construct a balanced dataset. And then figured out how many times each minority sample has to be augmented. Each minority class sample is translated into any random language, and then back-translated into English using a machine translation model.

For random language translation, we consider a collection of 35 languages as specified in . We followed the strategy of selection of languages as in Corbeil and Ghadivel (Citation2020). Using backtranslation class-balancing approach, the F1-score is improved significantly across all datasets ( and ).

Figure 10. Performance of AbSC model using Backtranslation class balancing approach.

6. Results & discussion

In this research work, we explored the use of synthetic data for addressing the issue of class imbalance in the Aspect-based Sentiment Classification task. We proposed two methods for the generation of synthetic data: first, paraphrasing using the PEGASUS transformer model and second, backtranslation using M2M100 neural machine translation model. The proposed methods are compared with data-level and algorithm-level class balancing methods. Among the methods used for comparison, cross-entropy loss with class weights (Section 5.4) is an algorithm-level method for handling class imbalance. The cross-entropy loss function is modified so that the loss contribution from minority class samples is more. Weighted oversampling (Section 5.3) is a data-level method for handling class imbalance. In weighted oversampling, each sample in the dataset is assigned a probability based on the number of samples per class. The minority class sample is assigned a higher probability, so the likelihood of drawing a minority class sample is higher when drawing samples for a batch. The performance of supervised methods depends upon the size of the training data. As we can see in all the SemEval datasets are smaller in size. In our proposed class-balancing technique, new minority class samples are generated synthetically using paraphrasing and backtranslation. These methods outperform the remaining two methods for handling class imbalance. Using synthetically created data solves the problems of small train data and uneven class distribution.

The Macro-F1 score achieved for all the balancing methods is shown in and . As can be seen in , we achieved improved results after applying the class-balancing methods. As compared to SemEval 2014, the performance improvement in overall F1-Score is more significant in SemEval 2015 and 2016 datasets due to higher maximum between-class imbalance level. The backtranslation method achieved the highest Macro-F1 score among all the class balancing methods.

Figure 11. Performance of AbSC model using all class balancing approaches.

Since we are concentrated on enhancing minority class results through class-balancing approaches, we are particularly intrigued to observe the class-wise performance. The F1-score for each class is displayed in . – displays the performance of all the class-balancing approaches for the Negative, Neutral and Positive class respectively. In the Semeval 2016 dataset, both the proposed paraphrase and backtranslation approaches perform equally well in the negative and neutral classes. And for SemEval 2014 and 2015, backtranslation performance is better in most of the cases. For the cross-entropy loss with class weights and weighted oversampling method, in order to improve minority class performance, the majority (Positive) class F1-score deteriorates however this is not the case with the proposed paraphrasing and backtranslation method.

Figure 12. Performance (F1-Score) of AbSC model for negative class.

Figure 13. Performance (F1-Score) of AbSC model for neutral class.

Figure 14. Performance (F1-Score) of AbSC model for positive class.

Table 4. Class-wise Performance (F1-Score) of all class-balancing approaches.

Download CSV Display Table

The application of the paraphrasing approach is limited by the availability of paraphrasing models for the given language. However, because there are several neural machine translation models available that support a wide number of languages, the backtranslation technique might prove more beneficial. The methods we described are applicable to datasets of any language and offer the dual benefits of balancing the dataset and increasing its diversity, which is particularly useful for low-resource language data. The quality of the generated data in both paraphrasing methods depends on the models employed. The choice of intermediate languages is also crucial in backtranslation. We aim to extend the research by selecting only similar languages for backtranslation and study the improvement in performance.

7. Conclusion

The majority of the existing aspect-based sentiment classification methods have used SemEval-2014, 2015, and 2016 datasets as benchmark datasets. We observed that the class distribution in these datasets is not balanced; the positive polarity class samples significantly outnumber the negative and neutral class samples. The imbalance ratio in SemEval datasets ranges from 3.39 to 18.

In this paper, we addressed the class imbalance problem in SemEval datasets for the restaurant domain used for aspect-based sentiment classification. Through the use of backtranslation and paraphrasing, synthetic data is created for minority class samples. We have also experimented with two more methods for class balancing such as weighted oversampling and modified cross-entropy loss. We studied the impact of class-balancing methods on the performance of the BERT-based model for aspect-based sentiment classification. Experimental studies show that all four imbalanced data handling methods boost the performance of models by a certain percentage. Our proposed class balancing approach using backtranslation achieved the highest F1-score of 76.09%, 71.97%, and 74.69% for SemEval 2014, SemEval 2015, and SemEval 2016 restaurant domain datasets respectively, in comparison with the other methods.

The techniques we presented can be used regardless of the language of the dataset and have a twofold advantage, one of which is to balance the dataset and the other is to increase the diversity of the dataset, especially in the case of low-resource language data. However, the quality of generated data depends on the paraphrasing and machine translation model. The generation of data using backtranslation costs more time. We only considered exact matches when filtering the back-translated data, we plan to evaluate them for grammatical correctness and sentiment preservation. In this paper, we have generated synthetic samples for all the minority class samples. One future direction may be choosing the subset of minority class samples for paraphrasing which will give the highest accuracy. It will be interesting to investigate how the choice of intermediate languages for backtranslation affects performance. A hybrid class balancing strategy, which first uses paraphrased synthetic data to reduce high imbalance before employing a modified loss function, may be beneficial in extremely imbalanced datasets.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Notes

1 https://huggingface.co/tuner007/pegasus_paraphrase

References

Bengio, Y., Ducharme, R., Vincent, P., & Jauvin, C. (2003). A neural probabilistic language model. Journal of Machine Learning Research, 3(Feb), 1137–1155.
Google Scholar
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). Smote: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321–357. https://doi.org/10.1613/jair.953
Web of Science ®Google Scholar
Connor, S., Khoshgoftaar, T. M., & Borko, F. (2021). Text data augmentation for deep learning. Journal of Big Data, 8(1), 1–34.
PubMedGoogle Scholar
Corbeil, J.-P., & Ghadivel, H. A. (2020). Bet: A backtranslation approach for easy data augmentation in transformer-based paraphrase identification context. arXiv preprint arXiv:2009.12452.
Google Scholar
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Google Scholar
Fan, A., Bhosale, S., Schwenk, H., Ma, Z., El-Kishky, A., Goyal, S., Baines, M., Celebi, O., Wenzek, G., Chaudhary, V., Goyal, N., Birch, T., Liptchinsky, V., Edunov, S., Grave, E., Auli, M., & Joulin, A. (2021). Beyond english-centric multilingual machine translation. The Journal of Machine Learning Research, 22(1), 4839–4886.
Google Scholar
Fan, F., Feng, Y., & Zhao, D. (2018). Multi-grained attention network for aspect-level sentiment classification. In Proceedings of the 2018 conference on empirical methods in natural language processing (pp. 3433–3442). Association for Computational Linguistics.
Google Scholar
Feng, S. Y., Gangal, V., Wei, J., Chandar, S., Vosoughi, S., Mitamura, T., & Hovy, E. (2021). A survey of data augmentation approaches for NLP. arXiv preprint arXiv:2105.03075.
Google Scholar
Ganganwar, V. (2012). An overview of classification algorithms for imbalanced datasets. International Journal of Emerging Technology and Advanced Engineering, 2(4), 42–47.
Google Scholar
Ghiassi, M., & Lee, S. (2018). A domain transferable lexicon set for twitter sentiment analysis using a supervised machine learning approach. Expert Systems with Applications, 106, 197–216. https://doi.org/10.1016/j.eswa.2018.04.006
Web of Science ®Google Scholar
Han, H., Wang, W.-Y., & Mao, B.-H. (2005). Borderline-smote: A new over-sampling method in imbalanced data sets learning. In International conference on intelligent computing (pp. 878–887). Springer.
Google Scholar
Johnson, J. M., & Khoshgoftaar, T. M. (2019). Survey on deep learning with class imbalance. Journal of Big Data, 6(1), 27. https://doi.org/10.1186/s40537-019-0192-5
Google Scholar
Krawczyk, B. (2016). Learning from imbalanced data: open challenges and future directions. Progress in Artificial Intelligence, 5(4), 221–232. https://doi.org/10.1007/s13748-016-0094-0
Web of Science ®Google Scholar
Lin, T.-Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017). Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision (pp. 2980–2988). IEEE.
Google Scholar
Liu, B. (2010). Sentiment analysis and subjectivity. Handbook of Natural Language Processing, 2(2010), 627–666.
Google Scholar
Ma, D., Li, S., Zhang, X., & Wang, H. (2017). Interactive attention networks for aspect-level sentiment classification. arXiv preprint arXiv:1709.00893.
Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111–3119). Curran Associates Inc.
Google Scholar
Pang, B., & Lee, L. (2008). Opinion mining and sentiment analysis. Foundations and Trends® in Information Retrieval, 2(1–2), 1–135.
Google Scholar
Pennington, J., Socher, R., & Manning, C. D. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532–1543). Association for Computational Linguistics.
Google Scholar
Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep contextualized word representations. arXiv preprint arXiv:1802.05365.
Google Scholar
Pontiki, M., Galanis, D., Papageorgiou, H., Androutsopoulos, I., Manandhar, S., AL-Smadi, M., Al-Ayyoub, M., Zhao, Y., Qin, B., De Clercq, O., Hoste, V., Apidianaki, M., Tannier, X., Loukachevitch, N., Kotelnikov, E., Bel, N., Jiménez-Zafra, S. M., & Eryiğit, G. (2016). SemEval-2016 task 5: Aspect based sentiment analysis. In Proceedings of the 10th international workshop on semantic evaluation (SemEval-2016) (pp. 19–30). Association for Computational Linguistics.
Google Scholar
Pontiki, M., Galanis, D., Papageorgiou, H., Manandhar, S., & Androutsopoulos, I. (2015). SemEval-2015 task 12: Aspect based sentiment analysis. In Proceedings of the 9th international workshop on semantic evaluation (SemEval 2015) (pp. 486–495). Association for Computational Linguistics.
Google Scholar
Pontiki, M., Galanis, D., Pavlopoulos, J., Papageorgiou, H., Androutsopoulos, I., & Manandhar, S. (2014). SemEval-2014 task 4: Aspect based sentiment analysis. In Proceedings of the 8th international workshop on semantic evaluation (SemEval 2014) (pp. 27–35). Association for Computational Linguistics.
Google Scholar
Pouyanfar, S., Tao, Y., Mohan, A., Tian, H., Kaseb, A. S., Gauen, K., Dailey, R., Aghajanzadeh, S., Lu, Y.-H., Chen, S.-C., & Shyu, M. L. (2018). Dynamic sampling in convolutional neural networks for imbalanced data classification. In 2018 IEEE conference on multimedia information processing and retrieval (MIPR) (pp. 112–117). IEEE.
Google Scholar
Schouten, K., & Frasincar, F. (2015). Survey on aspect-level sentiment analysis. IEEE Transactions on Knowledge and Data Engineering, 28(3), 813–830. https://doi.org/10.1109/TKDE.2015.2485209
Web of Science ®Google Scholar
Sennrich, R., Haddow, B., & Birch, A. (2016). Improving neural machine translation models with monolingual data. In Proceedings of the 54th annual meeting of the association for computational linguistics (Volume 1: Long papers) (pp. 86–96). Association for Computational Linguistics.
Google Scholar
Song, Y., Wang, J., Jiang, T., Liu, Z., & Rao, Y. (2019). Attentional encoder network for targeted sentiment classification. arXiv preprint arXiv:1902.09314.
Google Scholar
Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. In Advances in neural information processing systems (pp. 3104–3112). Neural Information Processing Systems Foundation, Inc. (NeurIPS).
Google Scholar
Tang, D., Qin, B., Feng, X., & Liu, T. (2015). Effective lstms for target-dependent sentiment classification. arXiv preprint arXiv:1512.01100.
Google Scholar
Tang, F., Fu, L., Yao, B., & Xu, W. (2019). Aspect based fine-grained sentiment analysis for online reviews. Information Sciences, 488, 190–204. https://doi.org/10.1016/j.ins.2019.02.064
Web of Science ®Google Scholar
Tripathy, A., Anand, A., & Rath, S. K. (2017). Document-level sentiment classification using hybrid machine learning approach. Knowledge and Information Systems, 53(3), 805–831. https://doi.org/10.1007/s10115-017-1055-z
Web of Science ®Google Scholar
Wang, S., Liu, W., Wu, J., Cao, L., Meng, Q., & Kennedy, P. J. (2016). Training deep neural networks on imbalanced data sets. In 2016 international joint conference on neural networks (IJCNN) (pp. 4368–4374). IEEE.
Google Scholar
Wang, Y., Huang, M., Zhu, X., & Zhao, L. (2016). Attention-based lstm for aspect-level sentiment classification. In Proceedings of the 2016 conference on empirical methods in natural language processing (pp. 606–615). Association for Computational Linguistics.
Google Scholar
Wei, J., & Zou, K. (2019). Eda: Easy data augmentation techniques for boosting performance on text classification tasks. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP) (pp. 6382–6388). Association for Computational Linguistics.
Google Scholar
Zeng, B., Yang, H., Xu, R., Zhou, W., & Han, X. (2019). Lcf: A local context focus mechanism for aspect-based sentiment classification. Applied Sciences, 9(16), 3389. https://doi.org/10.3390/app9163389
Google Scholar
Zhang, J., Zhao, Y., Saleh, M., & Liu, P. (2020). Pegasus: Pre-training with extracted gap-sentences for abstractive summarization. In International conference on machine learning (pp. 11328–11339). PMLR.
Google Scholar

Employing synthetic data for addressing the class imbalance in aspect-based sentiment classification

ABSTRACT

1. Introduction

2. Related work

2.1. Aspect-based sentiment classification methods

2.2. Methods for handling class-imbalance

3. Dataset description

Table 1. Statistics of the SemEval Restaurant review datasets.

4. Methodology