1,604
Views
0
CrossRef citations to date
0
Altmetric
Research Article

Multi-branch feature learning based speech emotion recognition using SCAR-NET

, , , , &
Article: 2189217 | Received 25 Dec 2022, Accepted 04 Mar 2023, Published online: 27 Apr 2023

References

  • Aftab, A., Morsali, A., Ghaemmaghami, S., & Champagne, B. (2022). Light-SERNet: A lightweight fully convolutional neural network for speech emotion recognition. In ICASSP 2022–2022 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 6912–6916). IEEE.
  • Alaparthi, V. S., Pasam, T. R., Inagandla, D. A., Prakash, J., & Singh, P. K. (2022). ScSer: Supervised contrastive learning for speech emotion recognition using transformers. In 2022 15th international conference on human system interaction (HSI) (pp. 1–7). IEEE.
  • Anvarjon, T., & Kwon, S. (2020). Deep-net: A lightweight CNN-based speech emotion recognition system using deep frequency features. Sensors, 20(18), 5212. https://doi.org/10.3390/s20185212
  • Araujo, A., Norris, W., & Sim, J. (2019). Computing receptive fields of convolutional neural networks. Distill, 4(11), e21. https://doi.org/10.23915/distill
  • Assunção, G., Menezes, P., & Perdigão, F. (2020). Speaker awareness for speech emotion recognition. International Journal of Online and Biomedical Engineering, 16(4), 15–22. https://doi.org/10.3991/ijoe.v16i04.11870
  • Avots, E., Sapiński, T., Bachmann, M., & Kamińska, D. (2019). Audiovisual emotion recognition in wild. Machine Vision and Applications, 30(5), 975–985. https://doi.org/10.1007/s00138-018-0960-9
  • Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W. F., & Weiss, B. (2005). A database of German emotional speech. In Interspeech (Vol 5, pp. 1517–1520). ISCA.
  • Chavhan, Y., Dhore, M., & Yesaware, P. (2010). Speech emotion recognition using support vector machine. International Journal of Computer Applications, 1(20), 6–9. https://doi.org/10.5120/ijca
  • Chen, M., He, X., Yang, J., & Zhang, H. (2018). 3-D convolutional recurrent neural networks with attention model for speech emotion recognition. IEEE Signal Processing Letters, 25(10), 1440–1444. https://doi.org/10.1109/LSP.97
  • Daniel, N., Kjell, E., & Kornel, L. (2006). Emotion recognition in spontaneous speech using GMMs. In Proceedings of the 9th isca international conference on spoken language processing. IEEE.
  • De Gelder, B. (2009). Why bodies? Twelve reasons for including bodily expressions in affective neuroscience. Philosophical Transactions of the Royal Society B: Biological Sciences, 364(1535), 3475–3484. https://doi.org/10.1098/rstb.2009.0190
  • Greco, A., Valenza, G., Citi, L., & Scilingo, E. P. (2016). Arousal and valence recognition of affective sounds based on electrodermal activity. IEEE Sensors Journal, 17(3), 716–725. https://doi.org/10.1109/JSEN.2016.2623677
  • Gu, J., Wang, Z., Kuen, J., Ma, L., Shahroudy, A., Shuai, B., Liu, T., Wang, X., Wang, G., Cai, J., & Chen, T. (2018). Recent advances in convolutional neural networks. Pattern Recognition, 77(2018), 354–377. https://doi.org/10.1016/j.patcog.2017.10.013
  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the ieee conference on computer vision and pattern recognition (pp. 770–778). IEEE.
  • Jenke, R., Peer, A., & Buss, M. (2014). Feature extraction and selection for emotion recognition from EEG. IEEE Transactions on Affective Computing, 5(3), 327–339. https://doi.org/10.1109/TAFFC.2014.2339834
  • Kamińska, D., & Pelikant, A. (2012). Recognition of human emotion from a speech signal based on Plutchik's model. International Journal of Electronics and Telecommunications, 58(2), 165–170. https://doi.org/10.2478/v10177-012-0024-4
  • Kamińska, D., Sapiński, T., & Anbarjafari, G. (2017). Efficiency of chosen speech descriptors in relation to emotion recognition. EURASIP Journal on Audio, Speech, and Music Processing, 2017(1), 1–9. https://doi.org/10.1186/s13636-016-0099-4
  • Khamparia, A., Gupta, D., Nguyen, N. G., Khanna, A., Pandey, B., & Tiwari, P. (2019). Sound classification using convolutional neural network and tensor deep stacking network. IEEE Access, 7(2019), 7717–7727. https://doi.org/10.1109/ACCESS.2018.2888882
  • Kulkarni, K., Corneanu, C. A., Ofodile, I., Escalera, S., Baro, X., Hyniewska, S., Allik, J., & Anbarjafari, G. (2018). Automatic recognition of facial displays of unfelt emotions. IEEE Transactions on Affective Computing, 12(2), 377–390. https://doi.org/10.1109/TAFFC.2018.2874996
  • Kwon, S. (2019). A CNN-assisted enhanced audio signal processing for speech emotion recognition. Sensors, 20(1), 183. https://doi.org/10.3390/s20010183
  • Kwon, S. (2020). CLSTM: Deep feature-based speech emotion recognition using the hierarchical ConvLSTM network. Mathematics, 8(12), 2133. https://doi.org/10.3390/math8122133
  • Kwon, S. (2021). Optimal feature selection based speech emotion recognition using two-stream deep convolutional neural network. International Journal of Intelligent Systems, 36(9), 5116–5135. https://doi.org/10.1002/int.v36.9
  • Kwon, S., Mustaqeem (2021). MLT-DNet: Speech emotion recognition using 1D dilated CNN based on multi-learning trick approach. Expert Systems with Applications, 167(2021), Article 114177. https://doi.org/10.1016/j.eswa.2020.114177
  • Li, G. M., Liu, N., & Zhang, J. A. (2022). Speech emotion recognition based on modified reliefF. Sensors, 22(21), 8152. https://doi.org/10.3390/s22218152
  • Lin, M., Chen, Q., & Yan, S. (2013). Network in network. arXiv preprint arXiv:1312.4400.
  • Livingstone, S. R., & Russo, F. A. (2018). The ryerson audio-visual database of emotional speech and song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PloS One, 13(5), e0196391. https://doi.org/10.1371/journal.pone.0196391
  • Luna-Jiménez, C., Kleinlein, R., Griol, D., Callejas, Z., Montero, J. M., & Fernández-Martínez, F. (2021). A proposal for multimodal emotion recognition using aural transformers and action units on RAVDESS dataset. Applied Sciences, 12(1), 327. https://doi.org/10.3390/app12010327
  • Ma, X., Wu, Z., Jia, J., Xu, M., Meng, H., & Cai, L. (2018). Emotion recognition from variable-length speech segments using deep learning on spectrograms. In Interspeech (pp. 3683–3687). ISCA.
  • Mirsamadi, S., Barsoum, E., & Zhang, C. (2017). Automatic speech emotion recognition using recurrent neural networks with local attention. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 2227–2231). IEEE.
  • Mustaqeem,  & Kwon, S. (2021). 1D-CNN: Speech emotion recognition system using a stacked network with dilated CNN features. CMC-Computers Materials & Continua, 67(3), 4039–4059. https://doi.org/10.32604/cmc.2021.015070
  • Nguyen, D., Sridharan, S., Nguyen, D. T., Denman, S., Tran, S. N., Zeng, R., & Fookes, C. (2020). Joint deep cross-domain transfer learning for emotion recognition. arXiv preprint arXiv:2003.11136.
  • Noroozi, F., Corneanu, C. A., & Anbarjafari, G. (2021). Survey on emotional body gesture recognition. IEEE Transactions on Affective Computing, 12(2), 505–523. https://doi.org/10.1109/TAFFC.2018.2874986
  • Noroozi, F., Sapiński, T., Kamińska, D., & Anbarjafari, G. (2017). Vocal-based emotion recognition using random forests and decision tree. International Journal of Speech Technology, 20(2), 239–246. https://doi.org/10.1007/s10772-017-9396-2
  • Nwe, T. L., Foo, S. W., & De Silva, L. C. (2003). Speech emotion recognition using hidden Markov models. Speech Communication, 41(4), 603–623. https://doi.org/10.1016/S0167-6393(03)00099-2
  • Pławiak, P., Sośnicki, T., & Rzecki, K. (2016). Hand body language gesture recognition based on signals from specialized glove and machine learning algorithms. IEEE Transactions on Industrial Informatics, 12(3), 1104-1113. https://doi.org/10.1109/TII.2016.2550528
  • Ren, L., Dong, J., Wang, X., Meng, Z., Zhao, L., & Deen, M. J. (2020). A data-driven auto-cnn-lstm prediction model for lithium-ion battery remaining useful life. IEEE Transactions on Industrial Informatics, 17(5), 3478–3487. https://doi.org/10.1109/TII.9424
  • Ren, L., Meng, Z., Wang, X., Zhang, L., & Yang, L. T. (2020). A data-driven approach of product quality prediction for complex production systems. IEEE Transactions on Industrial Informatics, 17(9), 6457–6465. https://doi.org/10.1109/TII.2020.3001054
  • Sajjad, M., & Kwon, S. (2020). Clustering-based speech emotion recognition by incorporating learned features and deep BiLSTM. IEEE Access, 8(2020), 79861–79875. https://doi.org/10.1109/Access.6287639
  • Schuller, B., Vlasenko, B., Eyben, F., Rigoll, G., & Wendemuth, A. (2009). Acoustic emotion recognition: A benchmark comparison of performances. In 2009 IEEE workshop on automatic speech recognition & understanding (pp. 552–557). IEEE.
  • Shinde, A. S., Patil, V. V., Khadse, K. R., Jadhav, N., Joglekar, S., & Hatwalne, M. (2022). ML based speech emotion recognition framework for music therapy suggestion system. In 2022 6th international conference on computing, communication, control and automation (ICCUBEA) (pp. 1–5). IEEE.
  • Shojaeilangari, S., Yau, W. Y., & Teoh, E. K. (2016). Pose-invariant descriptor for facial emotion recognition. Machine Vision and Applications, 27(7), 1063–1070. https://doi.org/10.1007/s00138-016-0794-2
  • Sowmya, G., Naresh, K., Sri, J. D., Sai, K. P., & Indira, D. V. (2022). Speech2Emotion: Intensifying emotion detection using MLP through RAVDESS dataset. In 2022 international conference on electronics and renewable systems (ICEARS) (pp. 1–3). IEEE.
  • Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the inception architecture for computer vision. In Proceedings of the ieee conference on computer vision and pattern recognition (pp. 2818–2826). IEEE.
  • Thakare, C., Chaurasia, N. K., Rathod, D., Joshi, G., & Gudadhe, S. (2019). Comparative analysis of emotion recognition system. International Research Journal of Engineering and Technology, 6(12), 380–384.
  • Trigeorgis, G., Ringeval, F., Brueckner, R., Marchi, E., Nicolaou, M. A., Schuller, B., & Zafeiriou, S. (2016). Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network. In 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5200–5204). ISCA.
  • Wang, W. (2010). Machine audition: Principles, algorithms and systems: principles, algorithms and systems. IGI Global.
  • Wang, X., Ren, L., Yuan, R., Yang, L. T., & Deen, M. J. (2022). Qtt-dlstm: A cloud-edge-aided distributed lstm for cyber-physical-social big data. IEEE Transactions on Neural Networks and Learning Systems, 1–13. https://doi.org/10.1109/TNNLS.2022.3140238
  • Wang, X., Yang, L. T., Ren, L., Wang, Y., & Deen, M. J. (2022). A tensor-based computing and optimization model for intelligent edge services. IEEE Network, 36(1), 40–44. https://doi.org/10.1109/MNET.011.1800508
  • Yenigalla, P., Kumar, A., Tripathi, S., Singh, C., Kar, S., & Vepa, J. (2018). Speech emotion recognition using spectrogram & phoneme embedding. In Interspeech (Vol. 2018, pp. 3688–3692). ISCA.
  • Zeng, Z., Pantic, M., Roisman, G. I., & Huang, T. S. (2009). A survey of affect recognition methods: Audio, visual, and spontaneous expressions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(1), 39–58. https://doi.org/10.1109/TPAMI.2008.52