1,604
Views
0
CrossRef citations to date
0
Altmetric
Research Article

Multi-branch feature learning based speech emotion recognition using SCAR-NET

, , , , &
Article: 2189217 | Received 25 Dec 2022, Accepted 04 Mar 2023, Published online: 27 Apr 2023

Abstract

Speech emotion recognition (SER) is an active research area in affective computing. Recognizing emotions from speech signals helps to assess human behaviour, which has promising applications in the area of human-computer interaction. The performance of deep learning-based SER methods relies heavily on feature learning. In this paper, we propose SCAR-NET, an improved convolutional neural network, to extract emotional features from speech signals and implement classification. This work includes two main parts: First, we extract spectral, temporal, and spectral-temporal correlation features through three parallel paths; and then split-convolve-aggregate residual blocks are designed for multi-branch deep feature learning. The features are refined by global average pooling (GAP) and pass through a softmax classifier to generate predictions for different emotions. We also conduct a series of experiments to evaluate the robustness and effectiveness of SCAR-NET which can achieve 96.45%, 83.13%, and 89.93% accuracy on the speech emotion datasets EMO-DB, SAVEE, and RAVDESS. These results show the outperformance of SCAR-NET.

1. Introduction

Emotions are vehicles for personal feelings and feedback. The understanding of emotions plays a crucial role in human-human interaction and is likewise an effective means to improve human-computer interaction. Therefore, the area of affective computing has gained great attention and made significant developments in the last decade (Zeng et al., Citation2009). The ultimate goal of affective computing is the automatic understanding and recognition of human emotions, and there are still many difficulties in achieving this. In fact, people often express their emotions through various mediums, such as body movements, facial expressions, speech, and physiological changes. In human-computer interaction, computers could capture these key mediums to recognise the emotional state of the users. Noticing this, researchers applied techniques from psychology, signal processing, and deep learning to affective computing, allowing computers to learn mediums such as body movements (Noroozi et al., Citation2021; Pławiak et al., Citation2016), facial expressions (Kulkarni et al., Citation2018; Shojaeilangari et al., Citation2016), speech signals (Kamińska & Pelikant, Citation2012; Kamińska et al., Citation2017; Noroozi et al., Citation2017), and physiological signals (Greco et al., Citation2016; Jenke et al., Citation2014), ultimately enabling emotion recognition. In earlier studies, facial expressions have received a lot of attention because of their rich emotional expressions. About 95% of the literature is based on facial expressions for affective computing (De Gelder, Citation2009). However, due to the advantages of low acquisition difficulty and little private information, affective computing based on speech signals is becoming the mainstream of current research.

Affective computing based on speech signals is also known as speech emotion recognition (SER). Computers with SER systems can grasp the emotional state of the users in real time and make targeted responses. This can facilitate the development of many fields such as education, medical care, and services, which is of great social importance. SER applies speech signal processing techniques to extract low-level descriptors (LLDs) from the collected speech signals such as frequency, amplitude, and pitch. And find the mapping relationship between these features and human emotions. There are already various SER methods, including traditional machine learning methods and deep learning methods. However, the acoustic features are largely influenced by the accent and intonation of different speakers, which makes identify accurate emotional features a challenging task for these methods. As a result, the existing SER methods have less accuracy in predicting emotions and tend to confound similar emotions.

We hope SER methods can be free from the influence of individual factors and learn emotional features with generalizability. To this end, we apply asymmetric convolution and multi-branch convolution to propose a novel method for SER called SCAR-NET. The main contributions of our method can be summarised as follows:

  • We provide a block for multi-dimensional feature extraction called parallel path, which is an improvement of the asymmetric convolution block (Szegedy et al., Citation2016). The parallel paths extract spectral features, temporal features, and spectral-temporal correlation features through three independent convolutional neural networks (CNNs) with different-sized convolutional kernels.

  • We design a multi-branch convolution structure to replace the single-channel one in the residual block (He et al., Citation2016), and propose a split-convolve-aggregate residual (SCAR) block. The SCAR blocks implement deep feature learning and help to obtain more general features.

  • We select only 40 Mel frequency cepstral coefficients (MFCCs) as feature input to lessen the computational load of the network, and substitute the fully connected layers of the CNN with GAP to reduce the overall parameter quantity.

We evaluate the proposed method on the speech emotion datasets EMO-DB (Burkhardt et al., Citation2005), SAVEE (W. Wang, Citation2010), and RAVDESS (Livingstone & Russo, Citation2018), and we obtain 96.45%, 83.13%, and 89.93% classification accuracy. Our method provides better recognition performances than state-of-the-art methods, which indicates its application value.

2. Literature review

Emotions have the specificity of crossing language, culture, and ethnicity, giving SER great social significance. The earliest SER research emerged in the mid-1980s. Since then, a large number of researchers have devoted themselves to this field, and many effective solutions have been proposed. The original SER methods mainly relied on acoustic statistical features. The researchers manually extracted low-level features from the speech signal, such as zero-crossing rate, energy, formants, and MFCCs to implement emotion classification. Later researchers brought traditional machine learning algorithms to this field, among which the most widely used are hidden Markov models (HMMs) (Nwe et al., Citation2003), Gaussian mixture models (GMMs) (Daniel et al., Citation2006), support vector machines (SVMs) (Chavhan et al., Citation2010), and so on.

With the development of SER, traditional machine learning methods cannot meet the demands of continuously growing amount of data and computation (X. Wang, Yang, et al., Citation2022). Deep learning-based methods gradually replace traditional machine learning methods (Ren, Meng, et al., Citation2020). Deep neural networks (DNNs) can extract complex features from speech signals through the process of learning, thereby increasing the accuracy of recognition. With the unique advantage of ignoring task-irrelevant information (Gu et al., Citation2018) and processing complex unstructured signals, CNNs have made great achievements in the area of SER (Kwon, Citation2019). Khamparia et al. (Citation2019) utilised CNNs to learn high-level hidden features from spectrograms to recognise emotions. The spectrograms are two-dimensional plots of the speech signals frequency with respect to time. Therefore, it is effective to obtain the features in the spectrograms in a similar way to how 2D-CNNs process images. Yenigalla et al. (Citation2018) also took the spectrograms as input and added multiple parallel paths into CNNs, which helped to achieve better recognition rate.

The application of CNNs increased the research interest of the SER. Some researchers have further noticed that speech signals are time-series signals, which means that speech signals have a strong context dependency. Learning this dependency is of great help in identifying emotions. So recurrent neural networks (RNNs) began to be applied to learn context information in speech signals. Many researchers combined CNNs and RNNs into SER networks. Chen et al. (Citation2018) proposed a CNN-RNNs based on a 3D attention mechanism, which can better preserve contextual emotion-related information. Ma et al. (Citation2018) applied a similar structure to make the network capable of handling longer speech signals, where the CNN was used to extract spatial features while the RNNs was used to solve timing problems. Long short-term memory (LSTM), as an improved RNN that solves the long-term dependencies problem, was also widely used (X. Wang, Ren, et al., Citation2022). Trigeorgis et al. (Citation2016) fused CNNs with LSTM and proposed a end-to-end SER method that can automatically learn the best feature description for speech signals directly from raw time series rather than the handcrafted features. Kwon (Citation2020) utilised hierarchical blocks of the convolutional LSTM to extract spatio-temporal features in the hierarchical correlated form of speech signals.

With the help of the literature, we realise the limitations in the various existing SER methods. Traditional machine learning algorithms require manual feature extraction, which is difficult to adapt to tasks with large amounts of data. CNN-based deep learning methods are weak in processing time-series signals such as speech signals, and need to be combined with RNNs to make up for this defect (Ren, Dong, et al., Citation2020). However, the ability of CNN-RNNs to capture emotional features is still less than ideal. In addition, advanced SER methods often need to obtain complex acoustic feature sets from speech signals as input, which makes the computational load of the network and the time cost of training high. For example, The INTERSPEECH 2009 Emotion Challenge (IS09) feature set, a commonly used feature set in SER (Mirsamadi et al., Citation2017), contains 384 features obtained by applying statistical functions to LLDs.

In response to the above limitations, we propose an improved method for SER named SCAR-NET. As an enhanced CNN, SCAR-NET excels at capturing spatial features. The design of the parallel paths improves its ability to extract temporal features. The SCAR blocks enable the network to learn more sparse and diverse features. The use of GAP reduces the parameter scale of the network. Moreover, the input of SCAR-NET is only 40 MFCCs, which lowers the computational load and training cost. The experimental evaluations corroborate that our method achieves better results than state-of-the-art SER methods.

3. Proposed method

In this section, we explain the proposed method specifically. We propose a SER network named SCAR-NET which consists of four main parts: feature input, parallel paths, SCAR module, and classifier. Figure  and Algorithm ?? respectively illustrate the overall structure and algorithm description of SCAR-NET, and further details of each part are described below.

Figure 1. The overall structure of SCAR-NET.

Figure 1. The overall structure of SCAR-NET.

3.1. Feature input

The initial input of the network is waveform audio files. However, the task-irrelevant information contained in speech signals will greatly increase the computational complexity of the network. To address this issue, feature selection is required. We perform a series of signal processing operations as follows to extract the desired features from speech signals.

  1. Segmentation and Normalization: Divide the speech signals into equal-length segments of 3 s. For segments with insufficient length, zero padding is used to make up the difference. Then normalise them between −1 and 1.

  2. Frame Blocking and Windowing: Split the segments into 64ms frames, with 16 ms overlaps to ensure their continuity. Then use a Hamming window to avoid the risk of spectral leakage.

  3. Fast Fourier Transform: Perform a 1024-point fast Fourier transform (FFT) on each frame. Convert it from a time-domain signal to a frequency-domain signal for the subsequent frequency analysis.

  4. Mel Scale Filter Bank: Use a Mel scale filter bank in the range of 40 Hz to 7600 Hz to remove the redundant part of the frequency-domain signals.

  5. Inverse Discrete Cosine Transform: Perform an inverse discrete cosine transform (DCT) on the filtered signals, and then the MFCCs of each frame are calculated.

The first 40 MFCCs are selected as features, which are input to the subsequent parts of the network for training.

3.2. Parallel paths

The second part of the network implements the extraction of advanced feature information from MFCCs. Figure  shows the MFCC spectrum obtained from the first part. The spectrum contains information in three dimensions: spectral, temporal, and spectral-temporal correlation. For this feature input that contains multiple dimensions, we adopt the method of processing each dimension separately and then integrating them. Therefore, three parallel paths are designed to achieve multi-dimensional feature extraction. The second box in Figure illustrates the structure of the parallel paths.

Figure 2. The MFCC spectrum and the receptive fields of the parallel paths.

Figure 2. The MFCC spectrum and the receptive fields of the parallel paths.

The three paths are very similar in structure. They all achieve feature extraction through convolution, batch normalisation, activation function, and average pooling. The only difference is the size of convolution kernels. Considering that the spectral dimension and the temporal dimension are equally important for SER, we hope to apply a square convolution kernel. The asymmetric convolution blocks use n×1 and 1×n sized kernels as a replacement for n×n sized one to conserve computational effort. Therefore, we draw on the asymmetric convolution blocks to use convolution kernels of sizes 9×1 and 1×9 to extract spectral features and temporal features. However, the spectral-temporal correlation dimension is just as critical for SER. We add a 3×3 sized convolution kernel on the basis of the asymmetric convolution block to handle these features. This design brings two benefits to the network:

  • Through the three independent parallel paths, multi-dimensional features can be effectively extracted. Moreover, the design of asymmetric convolution reduces the size of parameters on each path, thereby curtailing the risk of overfitting and improving the final classification accuracy of the network.

  • The parallel paths effectively decrease parameter quantity. In Figure , the receptive fields of 9×1, 1×9, and 3×3 size convolution kernels are represented by orange, blue, and yellow rectangles. After the combination of the three, the actual receptive field is supposed to be 9×9 size, which is represented by a red dotted rectangle. The receptive field is of equal length and width to ensure that the network is not biased towards a single dimension. There is a positive relationship between classification accuracy and the size of receptive fields (Araujo et al., Citation2019). Therefore, compared with the single path with a 9×9 size convolution kernel, the parallel paths can cut down the number of parameters in this part of the network while ensuring classification accuracy. Theoretically, the parameter quantity can be reduced to 9×1+1×9+3×39×9 of the original size.

After the feature extraction with the parallel paths, the obtained high-level feature information of three dimensions is concatenated and then fed into the third part of the network.

3.3. SCAR module

The SCAR module is the main part of the network. The third box in Figure shows the structure of this part. The beginning of this part is a transition layer that includes convolution, batch normalisation, activation function, and average pooling. The transition layer processes the high-level feature information from the parallel paths as a legitimate input. Then this information will undergo two consecutive SCAR blocks for feature learning.

In the process of feature learning, there is a problem called degradation. After reaching a certain depth, although the network continues to deepen, the training loss does not decrease but increases. The left side of Figure  is a residual block, which is designed to solve the degradation problem. For the same input x, a common CNN needs to fit the output H(x), while a residual block is supposed to fit F(x)=H(x)x. When the network reaches a certain depth, there is always a certain layer whose output is close to the optimal solution. As the next layer expects to be closer to the solution, it needs to update its weights. Since the input x is already close to the desired output H(x), a residual block can only update a small part of the weights to achieve fitting. Consequently, the degeneration problem of deep networks is avoided.

Figure 3. The residual block and the SCAR block.

Figure 3. The residual block and the SCAR block.

The SCAR block is an improvement from the residual block. The right side of Figure illustrates the specifications of the SCAR block. We replace the single-path convolution of the residual block with 8 parallel identical topological structures. The entire block afterward forms an architecture of split-convolve-aggregate (SCA). The high-level feature information is split into 8 feature subspaces after being input to the SCAR block, and then the feature learning will be realised through a series of convolutional layers that gradually increase in depth. After that, the output will be aggregated and delivered to the classifier. The SCAR block inherits the advantage that the residual block can avoid network degradation, and also benefits from the SCA structure. When splitting, the original feature space is divided into 8 subspaces. Each path learns the features in its own subspace, making the learned features relatively sparse and alleviating the problem of overfitting. While aggregating, the features of each subspace are merged to enhance the width of the network. This allows the network to obtain the feature information from different subspaces, and comprehend features from multiple perspectives. As a result, the final classification accuracy is improved.

At the tail of the SCAR module, we use GAP as a replacement for several fully connected layers of CNNs. Figure  explains how GAP works (Lin et al., Citation2013). It calculates the mean and outputs data nodes for each feature map eventually output by the network. In the proposed network, the last layer outputs 64 feature maps, so GAP will output 64 data nodes as a feature vector and send it into the classifier for classification calculation. Compared to the fully connected layer, GAP itself has no parameters, which reduces the number of parameters and avoids overfitting effectively.

Figure 4. The working principle of GAP.

Figure 4. The working principle of GAP.

The last part of the network is the classifier. It maps nonlinear input space into each linearly separable subspace, which corresponds to a specific emotion classification. As shown in the last box in Figure , the classifier includes a Dropout layer and a softmax layer. The former ignores a part of neurons with a certain probability to reduce the occurrence of overfitting, and the latter converts the network output into the probabilistic form.

4. Experimental evaluations and analysis

In this section, we train and evaluate the proposed network on public speech emotion datasets, and the experimental results will be compared with recent papers.

4.1. Datasets

To demonstrate the robustness and effectiveness of the proposed network, we select three public speech emotion datasets to evaluate them, namely EMO-DB, SAVEE, and RAVDESS.

EMO-DB. The Berlin Emotional Speech Database is a commonly used dataset in the SER domain. The recording of the dataset was completed by 10 professional actors and actresses (5 males and 5 females) in a professional recording studio. The corpus consists of 5 long and 5 short sentences selected under the principles of neutral semantics and daily spoken language. Actors and actresses are supposed to express 7 emotions in these sentences: anger, fear, happiness, neutral, sadness, disgust, and boredom. The dataset finally retains 535 audio files with a 3–5 seconds length using a 16 kHz sampling rate after the auditory discrimination experiments.

SAVEE. Surrey Audio-Visual Expressed Emotion is a dataset consisting of recordings of 4 male actors from the University of Surrey. The dataset contains 480 audio files expressing 7 emotions (surprise, anger, happiness, neutral, sadness, disgust, and fear). The audio sampling rate is 44.1kHz, and the corpus used is British English sentences selected from the standard TIMIT corpus.

RAVDESS. The Ryerson Audio-Visual Database of Emotional Speech and Song is a newly introduced emotion corpus that is widely used in the SER methods. It contains 24 professional actors and actresses (12 males and 12 females) expressing 8 emotions in a neutral North American accent, including angry, calm, disgust, fearful, happy, neutral, sad, and surprise. Each emotion, except neutral, is produced at both normal and strong intensities. The dataset includes 1440 audio files with a sample rate of 48 kHz and an average length of 3.5 s.

Table  illustrates the emotion classifications of the three datasets and the proportion of each emotion.

Table 1. The detailed descriptions of EMO-DB, SAVEE, and RAVDESS datasets.

4.2. Experimental setup

Experimental Environment. The proposed network is implemented in a Python 3.7 virtual environment, using version 2.3.0 of the TensorFlow deep learning framework. Both training and evaluation are run on a single Nvidia GeForce RTX 2080 Ti GPU.

Hyperparameters. We train the network model for 300 epochs and 32 batch sizes. The initial learning rate is 104. According to the decay strategy, starting from 50 rounds, the learning rate decays at a rate of e0.3 every 20 rounds. The ratio of Dropout in the classifier is set to 0.3.

Evaluation Strategy. Our experimental evaluations apply a leave-one-speaker-out (LOSO) (Schuller et al., Citation2009) cross-validation strategy. LOSO is a special K-fold cross-validation strategy, where K is the number of speakers in the dataset. LOSO divides the dataset into K groups so that each group corresponds to a speaker. Each time we select the audio data of one speaker as the test set, while the audio data of the remaining K-1 people are used as the training set for experiments. After repeating the experiment for K times, we take the mean value as the final experimental results. LOSO can make full use of the data so that the experimental results better represent the classification accuracy of the network. Among the three datasets we selected, EMO-DB has 10 speakers and SAVEE has 4 speakers. Therefore, the experimental evaluations are based on 10-fold cross-validation and 4-fold cross-validation, respectively. RAVDESS, with 24 speakers, is not suitable for LOSO. We divide the speakers into 8 groups and perform 8-fold cross-validation.

Evaluation Metrics. As the number of audio files for each emotion classification in the dataset is not balanced, using a single evaluation metric can only reflect the performance of the network one-sidedly. We choose a variety of evaluation metrics to evaluate the network, including precision, recall, F1_score, as well as their macro average and weighted average. Accuracy and confusion matrices are also used as evaluation metrics.

4.3. Experimental evaluations

Experimental evaluation is an important link in the deep learning process and a key step to verify the effectiveness of the network. We evaluate the practical performance of the proposed network on EMO-DB, SAVEE and RAVDESS using multiple evaluation metrics, demonstrating that our method can provide better classification accuracy. The evaluation results will be presented in the form of a classification report and three confusion matrices.

Table  is a classification report generated after evaluating SCAR-NET on the datasets. It lists the precision, recall, and F1_score of the network model for recognising each emotion classification. There are also the macro average, weighted average, and accuracy for the overall evaluation of the datasets. Through this report, a comprehensive understanding of the recognition strength of each emotion classification, as well as the overall performance of the model can be gained.

Table 2. The classification report of SCAR-NET evaluated on datasets.

In addition to the classification report, we also compare the actual emotion classification labels provided by the datasets with the predicted labels from the network model and draw the confusion matrix. The confusion matrix shows the proportion of audio data that is correctly or incorrectly classified by the model. The horizontal axis represents the predicted labels, and the vertical axis represents the actual labels. The intersection of each emotion classification with others represents the confusion ratio between them. The diagonal values are the recall for each emotion classification.

Figures  show the confusion matrices of SCAR-NET for EMO-DB, SAVEE, and RAVDESS. It can be seen that our method achieves excellent evaluation results on EMO-DB. Except for happiness, the classification recall is above 95% for all emotions. The evaluation results on SAVEE and RAVDESS are relatively lower than those on EMO-DB. However, they are still better than the SER method proposed in recent years, which will be explained detailedly in the experimental results.

Figure 5. The confusion matrix of SCAR-NET evaluated on EMO-DB.

Figure 5. The confusion matrix of SCAR-NET evaluated on EMO-DB.

Figure 6. The confusion matrix of SCAR-NET evaluated on SAVEE.

Figure 6. The confusion matrix of SCAR-NET evaluated on SAVEE.

Figure 7. The confusion matrix of SCAR-NET evaluated on RAVDESS.

Figure 7. The confusion matrix of SCAR-NET evaluated on RAVDESS.

4.4. Experimental results and analysis

For the purpose of determining the optimal structure of the network, we conduct a series of ablation experiments and comparative experiments on EMO-DB. After discussing and analysing the experimental results, we also compare our method with the state-of-the-art SER methods to demonstrate its effective improvement in evaluation results.

Effect of Parallel Paths. We evaluate the improvement effect of parallel paths on the network through an ablation experiment. Table  shows the experimental results. After replacing the single path (9×9 kernel) of the same size receptive field with the parallel paths (9×1, 1×9, and 3×3 kernels), the average precision, recall, and F1_score of the experimental evaluations are improved. Furthermore, the number of parameters in this part is reduced from 2624 to 960, and the final classification accuracy is improved by 1.31%, which proves that the multi-dimensional feature extraction using the parallel paths can effectively improve the classification accuracy while reducing the parameter quantity. This is achieved because the parallel paths decrease the number of parameters on each path, thereby suppressing the occurrence of overfitting.

Table 3. The effect of parallel paths on experimental evaluation.

Impact of SCA Structures. To determine the impact of SCA structures on the network, controlled experiments are required. Firstly, we set a network with 1 branch as a control group, which represents a network without the SCA structure. Then we set networks with 2, 4, 6, 8, and 10 branches as comparative experiments. The experimental results are shown in Table . Compared with the control group, the networks with the SCA structure obtains higher evaluation metrics, which proves that the application of the SCA structure can play a positive role in classification accuracy. For the networks with different numbers of branches, as the number increases to 8, the evaluation metrics gradually improve to the highest. When the number continues to increase, not only does it take more training time, but the evaluation metrics drop. This may be caused by the phenomenon of overfitting, which is triggered by a huge parameter quantity from excessive branches. Consequently, we apply the SCA structure with 8 branches in the network to ensure optimal classification accuracy.

Table 4. The impact of SCA structure with diverse branch numbers on experimental evaluation.

Impact of SCAR Blocks. There is a direct relation between the number of SCAR blocks and the depth of networks. Properly deepening networks is beneficial to feature learning. However, excessive network depth will cause a decrease of classification accuracy. Accordingly, we design controlled experiments to evaluate networks containing 1 to 4 SCAR blocks. Table  shows the experimental results. It can be seen that the network containing 2 SCAR blocks has the highest evaluation metrics. Afterward, the classification accuracy goes down with the number of blocks increases. This indicates that 2 SCAR blocks allow the network to reach an appropriate depth, and further increasing the depth will result in overfitting problems leading to network performance degradation. Eventually, we determine to utilise 2 SCAR blocks in the network for feature learning.

Table 5. The impact of diverse SCAR blocks numbers on experimental evaluation.

Comparison with Recent Methods. After determining the optimal structure of the network, we choose 6 recent papers on each of the three datasets, summary their methods, and compare them with ours. The comparative results can be found in Table . Compared with the state-of-the-art methods, our method improves the classification accuracy by 1.45%, 1.13%, and 2.09% respectively on EMO-DB, SAVEE, and RAVDESS. Compared to EMO-DB (535) and SAVEE (480), RAVDESS (1440) has a larger data size and obtains a greater improvement. This may indicate that the large dataset is more suitable for our deep network. The evaluations above prove the robustness and effectiveness of the proposed network. Furthermore, the optimal structure determined on EMO-DB also achieves ideal results on SAVEE and RAVDESS, proving its excellent generalizability.

Table 6. Comparative analysis of SCAR-NET with recent SER methods.

5. Conclusion and future direction

In this paper, we propose a more effective SER network named SCAR-NET in view of the defects existing in the current methods. SCAR-NET computes MFCCs from the input audios and uses parallel paths with three convolution kernels of different sizes to extract multi-dimensional features. Feature learning is achieved by the residual block with the split-convolve-aggregate structure. In addition, we discuss the impact of the number of SCA branches and SCAR blocks on the network performance through experiments and finally determine the optimal structure of the network. Based on the above work, our method achieves 96.45%, 83.13%, and 89.93% classification accuracy on the speech emotion datasets EMO-DB, SAVEE, and RAVDESS, proving its robustness and effectiveness. The limitation of our method is: the SCAR module improves network performance by using a number of repetitive structures, which raises the overall parameter quantity of the network.

In the future, we will further tune and improve SCAR-NET in pursuit of better performance and lower parameter quantity, as well as evaluate its capability in applications over a more realistic corpus. We will also try to apply SCAR-NET in the process of depression recognition. Depression, as an affective disorder, changes the emotional response of patients. Extracting emotional features in speech with SER methods can be great helpful in the recognition of depression.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

This work was supported by the Basic Public Welfare Research Project of Zhejiang Province [grant number LGG22F020014] and the National Natural Science Foundation of China [grant number 62072410].

References

  • Aftab, A., Morsali, A., Ghaemmaghami, S., & Champagne, B. (2022). Light-SERNet: A lightweight fully convolutional neural network for speech emotion recognition. In ICASSP 2022–2022 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 6912–6916). IEEE.
  • Alaparthi, V. S., Pasam, T. R., Inagandla, D. A., Prakash, J., & Singh, P. K. (2022). ScSer: Supervised contrastive learning for speech emotion recognition using transformers. In 2022 15th international conference on human system interaction (HSI) (pp. 1–7). IEEE.
  • Anvarjon, T., & Kwon, S. (2020). Deep-net: A lightweight CNN-based speech emotion recognition system using deep frequency features. Sensors, 20(18), 5212. https://doi.org/10.3390/s20185212
  • Araujo, A., Norris, W., & Sim, J. (2019). Computing receptive fields of convolutional neural networks. Distill, 4(11), e21. https://doi.org/10.23915/distill
  • Assunção, G., Menezes, P., & Perdigão, F. (2020). Speaker awareness for speech emotion recognition. International Journal of Online and Biomedical Engineering, 16(4), 15–22. https://doi.org/10.3991/ijoe.v16i04.11870
  • Avots, E., Sapiński, T., Bachmann, M., & Kamińska, D. (2019). Audiovisual emotion recognition in wild. Machine Vision and Applications, 30(5), 975–985. https://doi.org/10.1007/s00138-018-0960-9
  • Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W. F., & Weiss, B. (2005). A database of German emotional speech. In Interspeech (Vol 5, pp. 1517–1520). ISCA.
  • Chavhan, Y., Dhore, M., & Yesaware, P. (2010). Speech emotion recognition using support vector machine. International Journal of Computer Applications, 1(20), 6–9. https://doi.org/10.5120/ijca
  • Chen, M., He, X., Yang, J., & Zhang, H. (2018). 3-D convolutional recurrent neural networks with attention model for speech emotion recognition. IEEE Signal Processing Letters, 25(10), 1440–1444. https://doi.org/10.1109/LSP.97
  • Daniel, N., Kjell, E., & Kornel, L. (2006). Emotion recognition in spontaneous speech using GMMs. In Proceedings of the 9th isca international conference on spoken language processing. IEEE.
  • De Gelder, B. (2009). Why bodies? Twelve reasons for including bodily expressions in affective neuroscience. Philosophical Transactions of the Royal Society B: Biological Sciences, 364(1535), 3475–3484. https://doi.org/10.1098/rstb.2009.0190
  • Greco, A., Valenza, G., Citi, L., & Scilingo, E. P. (2016). Arousal and valence recognition of affective sounds based on electrodermal activity. IEEE Sensors Journal, 17(3), 716–725. https://doi.org/10.1109/JSEN.2016.2623677
  • Gu, J., Wang, Z., Kuen, J., Ma, L., Shahroudy, A., Shuai, B., Liu, T., Wang, X., Wang, G., Cai, J., & Chen, T. (2018). Recent advances in convolutional neural networks. Pattern Recognition, 77(2018), 354–377. https://doi.org/10.1016/j.patcog.2017.10.013
  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the ieee conference on computer vision and pattern recognition (pp. 770–778). IEEE.
  • Jenke, R., Peer, A., & Buss, M. (2014). Feature extraction and selection for emotion recognition from EEG. IEEE Transactions on Affective Computing, 5(3), 327–339. https://doi.org/10.1109/TAFFC.2014.2339834
  • Kamińska, D., & Pelikant, A. (2012). Recognition of human emotion from a speech signal based on Plutchik's model. International Journal of Electronics and Telecommunications, 58(2), 165–170. https://doi.org/10.2478/v10177-012-0024-4
  • Kamińska, D., Sapiński, T., & Anbarjafari, G. (2017). Efficiency of chosen speech descriptors in relation to emotion recognition. EURASIP Journal on Audio, Speech, and Music Processing, 2017(1), 1–9. https://doi.org/10.1186/s13636-016-0099-4
  • Khamparia, A., Gupta, D., Nguyen, N. G., Khanna, A., Pandey, B., & Tiwari, P. (2019). Sound classification using convolutional neural network and tensor deep stacking network. IEEE Access, 7(2019), 7717–7727. https://doi.org/10.1109/ACCESS.2018.2888882
  • Kulkarni, K., Corneanu, C. A., Ofodile, I., Escalera, S., Baro, X., Hyniewska, S., Allik, J., & Anbarjafari, G. (2018). Automatic recognition of facial displays of unfelt emotions. IEEE Transactions on Affective Computing, 12(2), 377–390. https://doi.org/10.1109/TAFFC.2018.2874996
  • Kwon, S. (2019). A CNN-assisted enhanced audio signal processing for speech emotion recognition. Sensors, 20(1), 183. https://doi.org/10.3390/s20010183
  • Kwon, S. (2020). CLSTM: Deep feature-based speech emotion recognition using the hierarchical ConvLSTM network. Mathematics, 8(12), 2133. https://doi.org/10.3390/math8122133
  • Kwon, S. (2021). Optimal feature selection based speech emotion recognition using two-stream deep convolutional neural network. International Journal of Intelligent Systems, 36(9), 5116–5135. https://doi.org/10.1002/int.v36.9
  • Kwon, S., Mustaqeem (2021). MLT-DNet: Speech emotion recognition using 1D dilated CNN based on multi-learning trick approach. Expert Systems with Applications, 167(2021), Article 114177. https://doi.org/10.1016/j.eswa.2020.114177
  • Li, G. M., Liu, N., & Zhang, J. A. (2022). Speech emotion recognition based on modified reliefF. Sensors, 22(21), 8152. https://doi.org/10.3390/s22218152
  • Lin, M., Chen, Q., & Yan, S. (2013). Network in network. arXiv preprint arXiv:1312.4400.
  • Livingstone, S. R., & Russo, F. A. (2018). The ryerson audio-visual database of emotional speech and song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PloS One, 13(5), e0196391. https://doi.org/10.1371/journal.pone.0196391
  • Luna-Jiménez, C., Kleinlein, R., Griol, D., Callejas, Z., Montero, J. M., & Fernández-Martínez, F. (2021). A proposal for multimodal emotion recognition using aural transformers and action units on RAVDESS dataset. Applied Sciences, 12(1), 327. https://doi.org/10.3390/app12010327
  • Ma, X., Wu, Z., Jia, J., Xu, M., Meng, H., & Cai, L. (2018). Emotion recognition from variable-length speech segments using deep learning on spectrograms. In Interspeech (pp. 3683–3687). ISCA.
  • Mirsamadi, S., Barsoum, E., & Zhang, C. (2017). Automatic speech emotion recognition using recurrent neural networks with local attention. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 2227–2231). IEEE.
  • Mustaqeem,  & Kwon, S. (2021). 1D-CNN: Speech emotion recognition system using a stacked network with dilated CNN features. CMC-Computers Materials & Continua, 67(3), 4039–4059. https://doi.org/10.32604/cmc.2021.015070
  • Nguyen, D., Sridharan, S., Nguyen, D. T., Denman, S., Tran, S. N., Zeng, R., & Fookes, C. (2020). Joint deep cross-domain transfer learning for emotion recognition. arXiv preprint arXiv:2003.11136.
  • Noroozi, F., Corneanu, C. A., & Anbarjafari, G. (2021). Survey on emotional body gesture recognition. IEEE Transactions on Affective Computing, 12(2), 505–523. https://doi.org/10.1109/TAFFC.2018.2874986
  • Noroozi, F., Sapiński, T., Kamińska, D., & Anbarjafari, G. (2017). Vocal-based emotion recognition using random forests and decision tree. International Journal of Speech Technology, 20(2), 239–246. https://doi.org/10.1007/s10772-017-9396-2
  • Nwe, T. L., Foo, S. W., & De Silva, L. C. (2003). Speech emotion recognition using hidden Markov models. Speech Communication, 41(4), 603–623. https://doi.org/10.1016/S0167-6393(03)00099-2
  • Pławiak, P., Sośnicki, T., & Rzecki, K. (2016). Hand body language gesture recognition based on signals from specialized glove and machine learning algorithms. IEEE Transactions on Industrial Informatics, 12(3), 1104-1113. https://doi.org/10.1109/TII.2016.2550528
  • Ren, L., Dong, J., Wang, X., Meng, Z., Zhao, L., & Deen, M. J. (2020). A data-driven auto-cnn-lstm prediction model for lithium-ion battery remaining useful life. IEEE Transactions on Industrial Informatics, 17(5), 3478–3487. https://doi.org/10.1109/TII.9424
  • Ren, L., Meng, Z., Wang, X., Zhang, L., & Yang, L. T. (2020). A data-driven approach of product quality prediction for complex production systems. IEEE Transactions on Industrial Informatics, 17(9), 6457–6465. https://doi.org/10.1109/TII.2020.3001054
  • Sajjad, M., & Kwon, S. (2020). Clustering-based speech emotion recognition by incorporating learned features and deep BiLSTM. IEEE Access, 8(2020), 79861–79875. https://doi.org/10.1109/Access.6287639
  • Schuller, B., Vlasenko, B., Eyben, F., Rigoll, G., & Wendemuth, A. (2009). Acoustic emotion recognition: A benchmark comparison of performances. In 2009 IEEE workshop on automatic speech recognition & understanding (pp. 552–557). IEEE.
  • Shinde, A. S., Patil, V. V., Khadse, K. R., Jadhav, N., Joglekar, S., & Hatwalne, M. (2022). ML based speech emotion recognition framework for music therapy suggestion system. In 2022 6th international conference on computing, communication, control and automation (ICCUBEA) (pp. 1–5). IEEE.
  • Shojaeilangari, S., Yau, W. Y., & Teoh, E. K. (2016). Pose-invariant descriptor for facial emotion recognition. Machine Vision and Applications, 27(7), 1063–1070. https://doi.org/10.1007/s00138-016-0794-2
  • Sowmya, G., Naresh, K., Sri, J. D., Sai, K. P., & Indira, D. V. (2022). Speech2Emotion: Intensifying emotion detection using MLP through RAVDESS dataset. In 2022 international conference on electronics and renewable systems (ICEARS) (pp. 1–3). IEEE.
  • Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the inception architecture for computer vision. In Proceedings of the ieee conference on computer vision and pattern recognition (pp. 2818–2826). IEEE.
  • Thakare, C., Chaurasia, N. K., Rathod, D., Joshi, G., & Gudadhe, S. (2019). Comparative analysis of emotion recognition system. International Research Journal of Engineering and Technology, 6(12), 380–384.
  • Trigeorgis, G., Ringeval, F., Brueckner, R., Marchi, E., Nicolaou, M. A., Schuller, B., & Zafeiriou, S. (2016). Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network. In 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5200–5204). ISCA.
  • Wang, W. (2010). Machine audition: Principles, algorithms and systems: principles, algorithms and systems. IGI Global.
  • Wang, X., Ren, L., Yuan, R., Yang, L. T., & Deen, M. J. (2022). Qtt-dlstm: A cloud-edge-aided distributed lstm for cyber-physical-social big data. IEEE Transactions on Neural Networks and Learning Systems, 1–13. https://doi.org/10.1109/TNNLS.2022.3140238
  • Wang, X., Yang, L. T., Ren, L., Wang, Y., & Deen, M. J. (2022). A tensor-based computing and optimization model for intelligent edge services. IEEE Network, 36(1), 40–44. https://doi.org/10.1109/MNET.011.1800508
  • Yenigalla, P., Kumar, A., Tripathi, S., Singh, C., Kar, S., & Vepa, J. (2018). Speech emotion recognition using spectrogram & phoneme embedding. In Interspeech (Vol. 2018, pp. 3688–3692). ISCA.
  • Zeng, Z., Pantic, M., Roisman, G. I., & Huang, T. S. (2009). A survey of affect recognition methods: Audio, visual, and spontaneous expressions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(1), 39–58. https://doi.org/10.1109/TPAMI.2008.52