255
Views
0
CrossRef citations to date
0
Altmetric
Research Article

Bi-Modal Bi-Task Emotion Recognition Based on Transformer Architecture

&
Article: 2356992 | Received 02 Jan 2024, Accepted 13 May 2024, Published online: 21 May 2024

References

  • Atmaja, B. T., and M. Akagi 2020. Multitask learning and multistage fusion for dimensional audiovisual emotion recognition. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4482–22. doi:10.1109/ICASSP40776.2020.9052916
  • Bendjoudi, I., F. Vanderhaegen, D. Hamad, and F. Dornaika. 2021. Multi-label, multi-task CNN approach for context-based emotion recognition. Information Fusion 76:422–28. doi:10.1016/j.inffus.2020.11.007
  • Cai, Y., X. Li, and J. Li. 2023. Emotion recognition using different sensors, emotion models, methods and datasets: A comprehensive review. Sensors 23 (5):2455. doi:10.3390/s23052455
  • Chang, X., and W. Skarbek. 2021. Multi-modal residual perceptron network for audio–video emotion recognition. Sensors 21 (16):5452. doi:10.3390/s21165452
  • Chen, W., X. Xing, P. Chen, and X. Xu (2024). Vesper: A compact and effective pretrained model for speech emotion recognition. IEEE Transactions on Affective Computing. doi:10.1109/TAFFC.2024.3369726
  • Datta, S., and S. Chakrabarti. 2022. Integrated two variant deep learners for aspect-based sentiment analysis: An improved meta-heuristic-based model. Cybernetics and Systems 1–37. doi:10.1080/01969722.2022.2145657
  • Feng, J., S. Cai, K. Li, Y. Chen, Q. Cai, and H. Zhao. 2023. Fusing syntax and semantics-based graph convolutional network for aspect-based sentiment analysis. International Journal of Data Warehousing and Mining 19 (1):1–15. doi:10.4018/IJDWM.319803
  • Feng, H., S. Ueno, and T. Kawahara. 2020. End-to-end speech emotion recognition combined with acoustic-to-word ASR model. In Interspeech, 501–05. doi:10.21437/Interspeech.2020-1180
  • Goshvarpour, A., and A. Goshvarpour. 2023. Novel high-dimensional phase space features for EEG emotion recognition. Signal, Image and Video Processing 17 (2):417–25. doi:10.1007/s11760-022-02248-6
  • Huang, W., S. Cai, H. Li, and Q. Cai. 2023. Structure graph refined information propagate network for aspect-based sentiment analysis. International Journal of Data Warehousing and Mining 19 (1):1–20. doi:10.4018/IJDWM.327363
  • Kakuba, S., A. Poulose, and D. S. Han. 2022a. Attention-based multi-learning approach for speech emotion recognition with dilated convolution. IEEE Access 10:122302–13. doi:10.1109/ACCESS.2022.3223705
  • Kakuba, S., A. Poulose, and D. S. Han. 2022b. Deep learning-based speech emotion recognition using multi-level fusion of concurrent features. IEEE Access 10:125538–51. doi:10.1109/ACCESS.2022.3225684
  • Kakuba, S., A. Poulose, and D. S. Han. 2023. Deep learning approaches for bimodal speech emotion recognition: Advancements, challenges, and a multi-learning model. Institute of Electrical and Electronics Engineers Access 11:113769–89. doi:10.1109/ACCESS.2023.3325037
  • Kumar, Y., and M. Mahajan. 2019. Machine learning based speech emotions recognition system. International Journal of Scientific and Technology Research 8 (7):722–29.
  • Latif, S., R. K. Rana, S. Khalifa, R. Jurdak, and B. Schuller 2022. Multitask learning from augmented auxiliary data for improving speech emotion recognition. IEEE Transactions on Affective Computing 14:4. doi:10.1109/TAFFC.2022.3221749
  • Le, H., G. Lee, S. Kim, S. Kim, and H. Yang. 2023. Multi-label multimodal emotion recognition with transformer-based fusion and emotion-level representation learning. Institute of Electrical and Electronics Engineers Access 11:14742–51. doi:10.1109/ACCESS.2023.3244390
  • Lian, Z., B. Liu, and J. Tao. 2021. Ctnet: Conversational transformer network for emotion recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29:985–1000. doi:10.1109/TASLP.2021.3049898
  • Lian, Z., B. Liu, and J. Tao (2022). Smin: Semi-supervised multi-modal interaction network for conversational emotion recognition. IEEE Transactions on Affective Computing 14:2415–2429. doi:10.1109/TAFFC.2022.3141237
  • Li, H., W. Ding, Z. Wu, and Z. Liu 2020. Learning fine-grained cross modality excitement for speech emotion recognition. arxiv preprint arxiv:2010.12733.
  • Lu, Z., L. Cao, Y. Zhang, C. Chiu, and J. Fan 2020. Speech sentiment analysis via pre-trained features from end-to-end asr models. In IEEE ICASSP, Barcelona, Spain, 7149–53. doi:10.1109/ICASSP40776.2020.9052937
  • Ma, H., J. Wang, H. Lin, B. Zhang, Y. Zhang, and B. Xu 2023. A transformer-based model with self-distillation for multimodal emotion recognition in conversations. ArXiv, abs/2310.20494.
  • Meng, W., and N. Yolwas. 2023. A study of speech recognition for Kazakh based on unsupervised pre-training. Sensors 23 (2):870. doi:10.3390/s23020870
  • Min, B., H. Ross, E. Sulem, A. P. B. Veyseh, T. H. Nguyen, O. Sainz, E. Agirre, I. Heintz, and D. Roth. 2023. Recent advances in natural language processing via large pre-trained language models: A survey. ACM Computing Surveys 56 (2):1–40. doi:10.1145/3605943
  • Morais, E. D., R. Hoory, W. Zhu, I. Gat, M. Damasceno, and H. Aronowitz 2022. Speech emotion recognition using self-supervised features. arXiv preprint arXiv:2202.03896.
  • Padi, S., S. O. Sadjadi, D. Manocha, and R. D. Sriram 2022. Multimodal emotion recognition using transfer learning from speaker recognition and BERT-based models. arXiv preprint arXiv:2202.08974.
  • Paszke, A., S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. 2019. PyTorch: An imperative style, high-performance deep learning library. Neural Information Processing Systems 32.
  • Ribeiro, A. H., and T. B. Schön. 2023. Overparameterized linear regression under adversarial attacks. IEEE Transactions on Signal Processing 71:601–14. doi:10.1109/TSP.2023.3246228
  • Sajjad, M., and S. Kwon. 2020. Clustering-based speech emotion recognition by incorporating learned features and deep BiLSTM. Institute of Electrical and Electronics Engineers Access 8:79861–75. doi:10.1109/ACCESS.2020.2990405
  • Samant, S. S., V. Singh, A. Chauhan, and J. Dasarahalli Narasimaiah. 2022. An optimized crossover framework for social media sentiment analysis. Cybernetics and Systems 1–29. doi:10.1080/01969722.2022.2146849
  • Sanh, V., L. Debut, J. Chaumond, and T. Wolf 2019. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. ar**v preprint ar**v:1910.01108.
  • Sarvakar, K., R. Senkamalavalli, S. Raghavendra, Kumar JS, Manjunath R, Jaiswal S. 2023. Facial emotion recognition using convolutional neural networks. Materials Today: Proceedings 80:3560–64. doi:10.1016/j.matpr.2021.07.297
  • Schneider, S., A. Baevski, R. Collobert, and M. Auli 2019. wav2vec: Unsupervised pre-training for speech recognition. ar**v preprint ar**v:1904.05862.
  • Sebastian, J., P. Pierucci. 2019. Fusion techniques for utterance level emotion recognition combining speech and transcripts. In Interspeech 2019 51–55. doi:10.21437/Interspeech.2019-3201
  • Sharafi, M., M. Yazdchi, R. Rasti, and F. Nasimi. 2022. A novel spatio-temporal convolutional neural framework for multimodal emotion recognition. Biomedical Signal Processing and Control 78:103970. doi:10.1016/j.bspc.2022.103970
  • Singh, P., R. Srivastava, K. P. Rana, and V. Kumar. 2021. A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229:107316. doi:10.1016/j.knosys.2021.107316
  • Tsai, Y. H., S. Bai, P. P. Liang, J. Z. Kolter, L. Morency, and R. Salakhutdinov 2019. Multimodal transformer for unaligned multimodal language sequences. In ACL 6558. doi:10.18653/v1/p19-1656
  • Vaswani, A., N. M. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. 2017. Attention is all you need. Advances in Neural Information Processing Systems 30.
  • Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I. 2017. Attention is all you need. NeurIPS 2017:30.
  • Wang, H., X. Li, Z. Ren, M. Wang, and C. Ma. 2023. Multimodal sentiment analysis representations learning via contrastive learning with condense attention fusion. Sensors 23 (5):2679. doi:10.3390/s23052679
  • Wang, J., M. Xue, R. Culhane, E. Diao, J. Ding, and V. Tarokh 2020. Speech emotion recognition with dual-sequence LSTM architecture. In IEEE ICASSP, Barcelona, Spain 6474–78. doi:10.1109/ICASSP40776.2020.9054629
  • Wen, H., S. You, and Y. Fu. 2021. Cross-modal dynamic convolution for multi-modal emotion recognition. Journal of Visual Communication and Image Representation 78:103178. doi:10.1016/j.jvcir.2021.103178
  • Wu, X., Lv S, Zang L, Han J, Hu S. 2019. Conditional BERT contextual augmentation. International Conference on Computational Science. Cham: Springer. doi:10.1007/978-3-030-22747-0_7
  • Xie, B., M. Sidulova, and C. H. Park. 2021. Robust multimodal emotion recognition from conversation with transformer-based crossmodality fusion. Sensors 21 (14):4913. doi:10.3390/s21144913
  • Xu, M., F. Zhang, and W. Zhang. 2021. Head fusion: Improving the accuracy and robustness of speech emotion recognition on the IEMOCAP and RAVDESS dataset. IEEE Access 9:74539–49. doi:10.1109/ACCESS.2021.3067460
  • Yang, E., J. W. Pan, X. M. Wang, H. B. Yu, L. Shen, X. H. Chen, L. Xiao, J. Jiang, and G. B. Guo. 2023. AdaTask: A task-aware adaptive learning rate approach to multi-task learning. Proceedings of the AAAI Conference on Artificial Intelligence 37 (9):10745–53. doi:10.1609/aaai.v37i9.26275
  • Zeng, Y., H. Mao, D. Peng, and Z. Yi. 2019. Spectrogram based multi-task audio classification. Multimedia Tools and Applications 78 (3):3705–22. doi:10.1007/s11042-017-5539-3