Moving From Traditional Machine Learning to Deep Learning for Speech Based Emotion Detection on Different Datasets

Authors

  • Priyanka Joshi School of Engineering& Technology, Shri Guru Ram Rai University, Uttarakhand India Author
  • Prof. Dr. Sonika Kandari School of Engineering& Technology, Shri Guru Ram Rai University, Uttarakhand India Author

DOI:

https://doi.org/10.21467/proceedings.7.5.1

Keywords:

deep learning, machine learning, speech emotion recognition

Abstract

Speech Emotion Recognition is a way to predict emotions of humans from their speech. SER technology has several applications in different fields such as in medical field to find patient’s emotional state for better diagnosis, in education, entertainment as well as better human computer interaction (HCI). For a better SER several researches are going on since last two decades and different approaches are presented by researchers to solve the task. Traditional machine learning models were used for SER by different researchers but now with the advancement of neural networks, the deep learning approaches are to be proven best for predicting emotions from speech signals. Therefore, we compare existing approaches and databases in Speech Emotion Recognition (SER) in order to arrive at workable solutions with more solid understanding of this open-ended problem. This is due to the development of neural networks, the constant requirement for precise and almost real-time SER in human computer interactions. This research gives the complete platform for an SER framework from machine learning models to deep learning models and the available datasets for SER.

References

[1] S. Ntalampiras, “Speech emotion recognition via learning analogies,” Pattern Recognit. Lett., vol. 144, pp. 21–26, Apr. 2021, doi: 10.1016/j.patrec.2021.01.018.

[2] B. W. Schuller, “Speech Emotion Recognition two decades in a Nutshell,” Commun. ACM, vol. 61, no. 5, pp. 90–99, 2018,

[3] X. Huahu, G. Jue, and Y. Jian, “Application of speech emotion recognition in intelligent household robot,” Proc. - Int. Conf. Artif. Intell. Comput. Intell. AICI 2010, vol. 1, pp. 537–541, 2010, doi: 10.1109/AICI.2010.118.

[4] T. Liu and X. Yuan, “Paralinguistic and spectral feature extraction for speech emotion classification using machine learning techniques,” Eurasip J. Audio, Speech, Music Process., vol. 2023, no. 1, 2023, doi: 10.1186/s13636-023-00290-x.

[5] E. Shriberg, L. Ferrer, S. Kajarekar, A. Venkataraman, and A. Stolcke, “Modeling prosodic feature sequences for speaker recognition,” Speech Commun., vol. 46, no. 3–4, pp. 455–472, 2005, doi: 10.1016/j.specom.2005.02.018.

[6] B. Schuller, G. Rigoll, and M. Lang, “Hidden Markov model-based speech emotion recognition,” Proc. - IEEE Int. Conf. Multimed. Expo, vol. 1, no. August 2003, pp. I401–I404, 2003, doi: 10.1109/ICME.2003.1220939.

[7] N. Kaur, “Speech Emotion Recognition using Different Centred GMM,” Int. J. Adv. Res. Comput. Sci. Softw. Eng., vol. 3, no. 10, pp. 646–649, 2013.

[8] B. J. Abbaschian, D. Sierra-sosa, and A. Elmaghraby, “Deep Learning Techniques for Speech Emotion Recognition , from Databases to Models,” 2021.

[9] S. Kumar* and C. A. Jason, “An Appraisal on Speech and Emotion Recognition Technologies based on Machine Learning,” Int. J. Recent Technol. Eng., vol. 8, no. 5, pp. 2266–2276, Jan. 2020, doi: 10.35940/ijrte.E5715.018520.

[10] S. Langari, H. Marvi, and M. Zahedi, “Efficient speech emotion recognition using modified feature extraction,” Informatics Med. Unlocked, vol. 20, Jan. 2020, doi: 10.1016/j.imu.2020.100424.

[11] M. S. Likitha, S. R. R. Gupta, K. Hasitha, and A. U. Raju, “Speech based human emotion recognition using MFCC,” Proc. 2017 Int. Conf. Wirel. Commun. Signal Process. Networking, WiSPNET 2017, vol. 2018-Janua, pp. 2257–2260, 2018, doi: 10.1109/WiSPNET.2017.8300161.

[12] M. Pervaiz and T. Ahmed, “Emotion Recognition from Speech using Prosodic and Linguistic Features,” Int. J. Adv. Comput. Sci. Appl., vol. 7, no. 8, 2016, doi: 10.14569/ijacsa.2016.070813.

[13] M. El Ayadi, M. S. Kamel, and F. Karray, “Survey on speech emotion recognition: Features, classification schemes, and databases,” Pattern Recognit., vol. 44, no. 3, pp. 572–587, 2011, doi: 10.1016/j.patcog.2010.09.020.

[14] M. J. Al Dujaili, A. Ebrahimi-Moghadam, and A. Fatlawi, “Speech emotion recognition based on SVM and KNN classifications fusion,” Int. J. Electr. Comput. Eng., vol. 11, no. 2, pp. 1259–1264, 2021, doi: 10.11591/ijece.v11i2.pp1259-1264.

[15] A. A. Abdelhamid, S. Member, A. Ibrahim, and M. M. Eid, “Robust Speech Emotion Recognition Using CNN + LSTM Based on Stochastic Fractal Search Optimization Algorithm,” IEEE Access, vol. 10, pp. 49265–49284, 2022, doi: 10.1109/ACCESS.2022.3172954.

[16] Mustaqeem, M. Sajjad, and S. Kwon, “Clustering-Based Speech Emotion Recognition by Incorporating Learned Features and Deep BiLSTM,” IEEE Access, vol. 8, pp. 79861–79875, 2020, doi: 10.1109/ACCESS.2020.2990405.

[17] H. A. Abdulmohsin, H. B. Abdul wahab, and A. M. J. Abdul hossen, “A new proposed statistical feature extraction method in speech emotion recognition,” Comput. Electr. Eng., vol. 93, Jul. 2021, doi: 10.1016/j.compeleceng.2021.107172.

[18] S. T. Alam Monisha and S. Sultana, “A Review of the Advancement in Speech Emotion Recognition for Indo-Aryan and Dravidian Languages,” Adv. Human-Computer Interact., vol. 2022, 2022, doi: 10.1155/2022/9602429.

[19] S. R. Livingstone and F. A. Russo, The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS). 2018.

[20] F. Burkhardt, A. Paeschke, M. Rolfes, W. Sendlmeier, and B. Weiss, “A database of German emotional speech,” 9th Eur. Conf. Speech Commun. Technol., no. May, pp. 1517–1520, 2005, doi: 10.21437/interspeech.2005-446.

[21] C. Busso et al., IEMOCAP: Interactive emotional dyadic motion capture database, vol. 42, no. 4. 2008. doi: 10.1007/s10579-008-9076-6.

[22] M. Zielonka, A. Piastowski, A. Czyżewski, P. Nadachowski, M. Operlejn, and K. Kaczor, “Recognition ` of Emotions in Speech Using Convolutional Neural Networks on Different Datasets,” Electron., vol. 11, no. 22, 2022, doi: 10.3390/electronics11223831.

[23] H. Cao, D. G. Cooper, M. K. Keutmann, R. C. Gur, A. Nenkova, and R. Verma, “CREMA-D: Crowd-sourced emotional multimodal actors dataset,” IEEE Trans. Affect. Comput., vol. 5, no. 4, pp. 377–390, 2014, doi: 10.1109/TAFFC.2014.2336244.

[24] B. Basharirad and M. Moradhaseli, “Speech emotion recognition methods: A literature review,” AIP Conf. Proc., vol. 1891, no. October 2018, 2017, doi: 10.1063/1.5005438.

[25] J. Duret, Y. Estève, and M. Rouvier, “MSP-Podcast SER Challenge 2024: L’antenne du Ventoux Multimodal elf-Supervised Learning for Speech Emotion Recognition,” no. June, pp. 309–314, 2024, doi: 10.21437/odyssey.2024-44.

[26] A. Al-Talabani, H. Sellahewa, and S. A. Jassim, “Emotion recognition from speech: tools and challenges,” in Mobile Multimedia/Image Processing, Security, and Applications 2015, SPIE, May 2015, p. 94970N. doi: 10.1117/12.2191623.

[27] K. Sim, “Pattern Recognition Methods for Emotion Recognition with speech signal,” no. August, 2014, doi: 10.5391/IJFIS.2006.6.2.150.

[28] A. S. Alluhaidan, O. Saidani, R. Jahangir, M. A. Nauman, and O. S. Neffati, “Speech Emotion Recognition through Hybrid Features and Convolutional Neural Network,” Appl. Sci., vol. 13, no. 8, 2023, doi: 10.3390/app13084750.

[29] S. Mirsamadi, E. Barsoum, and C. Zhang, “Automatic Speech Emotion Recognition Using Recurrent Neural Networks With Local Attention Center for Robust Speech Systems , The University of Texas at Dallas , Richardson , TX 75080 , USA Microsoft Research , One Microsoft Way , Redmond , WA 98052 , USA,” IEEE Int. Conf. Acoust. Speech, Signal Process. 2017, pp. 2227–2231, 2017, [Online]. Available: https://doi.org/10.1016/j.specom.2019.09.002

[30] W. Jiang, Z. Wang, J. S. Jin, X. Han, and C. Li, “Speech emotion recognition with heterogeneous feature unification of deep neural network,” Sensors (Switzerland), vol. 19, no. 12, pp. 1–15, 2019, doi: 10.3390/s19122730.

[31] S. K. Pandey, H. S. Shekhawat, and S. R. M. Prasanna, “Deep learning techniques for speech emotion recognition: A review,” 2019 29th Int. Conf. Radioelektronika, RADIOELEKTRONIKA 2019 - Microw. Radio Electron. Week, MAREW 2019, no. July, pp. 1–6, 2019, doi: 10.1109/RADIOELEK.2019.8733432.

[32] S. Kwon, “A CNN-Assisted Enhanced Audio Signal Processing,” Sensors, 2020.

[33] M. Farooq, F. Hussain, N. K. Baloch, F. R. Raja, H. Yu, and Y. Bin Zikria, “Impact of feature selection algorithm on speech emotion recognition using deep convolutional neural network,” Sensors (Switzerland), vol. 20, no. 21, pp. 1–18, 2020, doi: 10.3390/s20216008.

[34] J. Liu and H. Wang, “A Speech Emotion Recognition Framework for Better Discrimination of Confusions,” Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, vol. 1, pp. 586–590, 2021, doi: 10.21437/Interspeech.2021-718.

[35] R. R. Choudhary, G. Meena, and K. K. Mohbey, “Speech Emotion Based Sentiment Recognition using Deep Neural Networks,” J. Phys. Conf. Ser., vol. 2236, no. 1, 2022, doi: 10.1088/1742-6596/2236/1/012003.

[36] S. Akinpelu, S. Viriri, and A. Adegun, “An enhanced speech emotion recognition using vision transformer,” Sci. Rep., vol. 14, no. 1, pp. 1–17, 2024, doi: 10.1038/s41598-024-63776-4.

Downloads

Published

2025-09-23

How to Cite

[1]
P. Joshi and S. Kandari, “Moving From Traditional Machine Learning to Deep Learning for Speech Based Emotion Detection on Different Datasets”, AIJR Proc., vol. 7, no. 5, pp. 1–7, Sep. 2025, doi: 10.21467/proceedings.7.5.1.