ABSTRACT
Speech emotion recognition (SER) has received significant attention recently and seems to be a critical aspect of human-computer interaction. The performance of the recent methods is not at the expected level there are many different strategies have been developed for SER. In the present situation, communication, non-verbal vocalisation, with a vocal sound plays an essential role in emotional expression. The multimodal database is considered for which the hybridised model of the audio-visual dependent emotion identification model is proposed. In this research, the optimised deep NN is used to recognise emotions from multimodal input data. For better performance in emotion recognition, the learner memorising optimisation fine-tunes the hyperparameters involved in the deep NN network classifier. The hybrid texture feature descriptor is proposed to improve classification outcomes from audio-video signals as well as accuracy. The learner memorising optimisation is used in the deep NN classifier to obtain the ideal parameters involved in exploring space with high accuracy. The metric values of the learner memorising optimisation-based deep NN have been evaluated using the Enter face’05 database and the proposed method obtained 97.490% accuracy, 98% sensitivity, and 97.490% specificity for the K-fold value 10% and 96.928%, 98.80%, and 96.928% for 90% training data.
Disclosure statement
No potential conflict of interest was reported by the author(s).