Speech Emotion Recognition via Sparse Learning-Based Fusion Model

Citations

WEB OF SCIENCE

2
Citations

SCOPUS

5

초록

Speech communication is a powerful tool for conveying intentions and emotions, fostering mutual understanding, and strengthening relationships. In the realm of natural human-computer interaction, speech-emotion recognition plays a crucial role. This process involves three stages: dataset collection, feature extraction, and emotion classification. Collecting speech-emotion recognition datasets is a complex and costly process, leading to limited data volumes and uneven emotional distributions. This scarcity and imbalance pose significant challenges, affecting the accuracy and reliability of emotion recognition. To address these issues, this study introduces a novel model that is more robust and adaptive. We employ the Ranking Magnitude Method (RMM) based on sparse learning. We use the Root Mean Square (RMS) energy and Zero Crossing Rate (ZCR) as temporal features to measure the speech's overall volume and noise intensity. The Mel Frequency Cepstral Coefficient (MFCC) is utilized to extract critical speech features, which are then integrated into a multivariate Long Short-Term Memory-Fully Convolutional Network (LSTM-FCN) model. We analyze the utterance levels using the log-Mel spectrogram for spatial features, processing these patterns through a 2D Convolutional Neural Network Squeeze and Excitation Network (CNN-SEN) model. The core of our method is a Sparse Learning-Based Fusion Model (SLBF), which addresses dataset imbalances by selectively retraining the underperforming nodes. This dynamic adjustment of learning priorities significantly enhances the robustness and accuracy of emotion recognition. Using this approach, our model outperforms state-of-the-art methods for various datasets, achieving impressive accuracy rates of 97.18%, 97.92%, 99.31%, and 96.89% for the EMOVO, RAVDESS, SAVE, and EMO-DB datasets, respectively.

키워드

Emotion recognitionSpeech recognitionHidden Markov modelsFeature extractionBrain modelingAccuracyConvolutional neural networksData modelsTime-domain analysisDeep learning2D convolutional neural network squeeze and excitation networkmultivariate long short-term memory-fully convolutional networklate fusionsparse learningFEATURESDATABASESATTENTIONNETWORK
제목
Speech Emotion Recognition via Sparse Learning-Based Fusion Model
저자
Min, Dong-JinKim, Deok-Hwan
DOI
10.1109/ACCESS.2024.3506565
발행일
2024
유형
Article
저널명
IEEE Access
12
페이지
177219 ~ 177235