560
Views
0
CrossRef citations to date
0
Altmetric
Note

Knowledge enhancement for speech emotion recognition via multi-level acoustic feature

, &
Article: 2312103 | Received 17 Oct 2023, Accepted 26 Jan 2024, Published online: 01 Feb 2024

References

  • Badshah, A. M., Ahmad, J., Rahim, N., & Baik, S. W. (2017). Speech emotion recognition from spectrograms with deep convolutional neural network. In International conference on platform technology and service (PlatCon) (pp. 1–5).
  • Baevski, A., Zhou, Y., Mohamed, A., & Auli, M. (2020). wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems, 33, 12449–12460.
  • Bakhshi, A., Wong, A. S. W., & Chalup, S. K. (2020). End-to-end speech emotion recognition based on time and frequency information using deep neural networks. In Proceedings of the 24th European conference on artificial intelligence (ECAI) (pp. 969–975).
  • Busso, C., Bulut, M., Lee, C. C., Kazemzadeh, A., Mower, E., Kim, S., Chang, J. N., Lee, S., & Narayanan, S. S. (2008). IEMOCAP: Interactive emotional dyadic motion capture database. Language Resources and Evaluation, 42(4), 335–359. https://doi.org/10.1007/s10579-008-9076-6
  • Cai, X., Yuan, J., Zheng, R., Huang, L., & Church, K. (2021). Speech emotion recognition with multi-task learning. In Proceedings of the 22nd annual conference of the international speech communication association (Interspeech) (pp. 4508–4512).
  • Chaurasiya, H. (2020). Time-frequency representations: Spectrogram, cochleogram and correlogram. Procedia Computer Science, 167, 1901–1910. https://doi.org/10.1016/j.procs.2020.03.209
  • Chen, L., Ge, L., & Zhou, W. (2023). IIM: An information interaction mechanism for aspect-based sentiment analysis. Connection Science, 35(1), 2283390. https://doi.org/10.1080/09540091.2023.2283390
  • Chen, L., & Rudnicky, A. I. (2021). Exploring Wav2vec 2.0 fine-tuning for improved speech emotion recognition. CoRR abs/2110.06309.
  • Cummins, N., Amiriparian, S., Hagerer, G., Batliner, A., Steidl, S., & Schuller, B. W. (2017). An image-based deep spectrum feature representation for the recognition of emotional speech. In Proceedings of the 25th ACM international conference on multimedia (pp. 478–484).
  • Guo, L., Wang, L., Xu, C., Dang, J., Chng, E. S., & Li, H. (2021). Representation learning with spectro-temporal-channel attention for speech emotion recognition. In IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 6304–6308).
  • Han, J., Zhang, Z., Ren, Z., & Schuller, B. W. (2021). EmoBed: Strengthening monomodal emotion recognition via training with crossmodal emotion embeddings. IEEE Transactions on Affective Computing, 12(3), 553–564. https://doi.org/10.1109/TAFFC.2019.2928297
  • Kim, T., Cho, S., Choi, S., Park, S., & Lee, S. (2020). Emotional voice conversion using multitask learning with text-to-speech. In IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 7774–7778).
  • Kumbhar, H. S., & Bhandari, S. U. (2019). Speech emotion recognition using MFCC features and LSTM network. In Proceedings of the 5th international conference on computing, communication, control and automation (ICCUBEA) (pp. 1–3).
  • Latif, S., Rana, R., Khalifa, S., Jurdak, R., & Epps, J. (2019). Direct modelling of speech emotion from raw speech. In Proceedings of the 20th annual conference of the international speech communication association (Interspeech) (pp. 3920–3924).
  • Li, Y., Hu, S. L., Wang, J., & Huang, Z. H. (2020). An introduction to the computational complexity of matrix multiplication. Journal of the Operations Research Society of China, 8(1), 29–43. https://doi.org/10.1007/s40305-019-00280-x
  • Mao, K., Wang, Y., Ren, L., Zhang, J., Qiu, J., & Dai, G. (2023). Multi-branch feature learning based speech emotion recognition using SCAR-NET. Connection Science, 35(1), 2189217. https://doi.org/10.1080/09540091.2023.2189217
  • McFee, B., Raffel, C., Liang, D., Ellis, D. P. W., McVicar, M., Battenberg, E., & Nieto, O. (2015). librosa: Audio and music signal analysis in python. In Proceedings of the 14th python in science conference (SciPy) (pp. 18–24).
  • Rao, K. S., Koolagudi, S. G., & Vempada, R. R. (2013). Emotion recognition from speech using global and local prosodic features. International Journal of Speech Technology, 16(2), 143–160. https://doi.org/10.1007/s10772-012-9172-2
  • Ren, Z., Kong, Q., Qian, K., Plumbley, M. D., & Schuller, B. W. (2018). Attention-based convolutional neural networks for acoustic scene classification. In Proceedings of the workshop on detection and classification of acoustic scenes and events (DCASE) (pp. 39–43).
  • Sheikhan, M., Abbasnezhad Arabi, M., & Gharavian, D. (2015). Structure and weights optimisation of a modified Elman network emotion classifier using hybrid computational intelligence algorithms: A comparative study. Connection Science, 27(4), 340–357. https://doi.org/10.1080/09540091.2015.1080224
  • Tarantino, L., Garner, P. N., & Lazaridis, A. (2019). Self-attention for speech emotion recognition. In Proceedings of the 20th annual conference of the international speech communication association (Interspeech) (pp. 2578–2582).
  • Tsai, Y. H., Bai, S., Liang, P. P., Kolter, J. Z., Morency, L., & Salakhutdinov, R. (2019). Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the 57th conference of the association for computational linguistics (ACL) (pp. 6558–6569).
  • Wani, T. M., Gunawan, T. S., Qadri, S. A. A., Mansor, H., Kartiwi, M., & Ismail, N. (2020). Speech emotion recognition using convolution neural networks and deep stride convolutional neural networks. In Proceedings of the 6th international conference on wireless and telematics (ICWT) (pp. 1–6).
  • Wu, W., Zhang, C., & Woodland, P. C. (2021). Emotion recognition by fusing time synchronous and time asynchronous representations. In IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 6269–6273).
  • Ye, Z., Jing, Y., Wang, Q., Li, P., Liu, Z., Yan, M., Zhang, Y., & Gao, D. (2023). Emotion recognition based on convolutional gated recurrent units with attention. Connection Science, 35(1), 2289833. https://doi.org/10.1080/09540091.2023.2289833
  • Zadeh, A., Liang, P. P., Poria, S., Cambria, E., & Morency, L. (2018). Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. In Proceedings of the 56th annual meeting of the association for computational linguistics (ACL) (pp. 2236–2246).
  • Zadeh, A., Zellers, R., Pincus, E., & Morency, L. (2016). Multimodal sentiment intensity analysis in videos: Facial gestures and verbal messages. IEEE Intelligent Systems, 31(6), 82–88. https://doi.org/10.1109/MIS.2016.94
  • Zhang, S., Zhao, X., & Tian, Q. (2022). Spontaneous speech emotion recognition using multiscale deep convolutional LSTM. IEEE Transactions on Affective Computing, 13(2), 680–688. https://doi.org/10.1109/TAFFC.2019.2947464
  • Zhao, H., Chen, H., Xiao, Y., & Zhang, Z. (2023). Privacy-enhanced federated learning against attribute inference attack for speech emotion recognition. In IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 1–5).
  • Zhao, H., Xiao, Y., & Zhang, Z. (2020). Robust semisupervised generative adversarial networks for speech emotion recognition via distribution smoothness. IEEE Access, 8, 106889–106900. https://doi.org/10.1109/Access.6287639
  • Zheng, F., Zhang, G., & Song, Z. (2001). Comparison of different implementations of MFCC. Journal of Computer Science and Technology, 16(6), 582–589. https://doi.org/10.1007/BF02943243
  • Zou, H., Si, Y., Chen, C., Rajan, D., & Chng, E. S. (2022). Speech emotion recognition with co-attention based multi-level acoustic information. In IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 7367–7371).