Poster Session 3, 1:15 PM - 2:00 PM: Room 163 [C5]

Emotion Detection from Speech Audio Using Deep Learning Architectures

Presenter: Tanmay Sonawane

Faculty Sponsor: Alfa Heryudono

School: UMass Dartmouth

Research Area: Computer Science

ABSTRACT

Human speech carries rich paralinguistic information, particularly emotion, which provides valuable insight into psychological state, intent, and behavioral response. This project investigates how modern deep learning architectures can detect emotion from speech audio using time-frequency representations stored as numerical arrays. Centered on the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS), the methodology involves systematic preprocessing of raw audio files into normalized Mel-based NumPy (.npy) representations, followed by multimodal learning that jointly processes raw waveforms and spectrogram arrays. A pretrained ResNet-18 architecture is employed as the primary convolutional backbone, implemented within the PyTorch framework. Model performance is evaluated using accuracy, F1-score, precision, recall, and confusion matrices, with an achieved baseline accuracy of approximately 70% on a five-class emotion mapping.

To further assess robustness and generalization, this work will be extended to the Toronto Emotional Speech Set (TESS), enabling cross-dataset evaluation and combined-training strategies. In addition to PyTorch-based models, equivalent architectures will be implemented and tested using TensorFlow to provide a comparative analysis of deep learning frameworks for speech emotion recognition. Differences in training dynamics, performance, and deployment considerations across frameworks will be systematically examined. The ultimate application focus of this research is emergency service call analysis, where real-time emotion detection can assist dispatchers by identifying heightened stress, fear, or distress in callers. By benchmarking models across datasets and frameworks, this project aims to support the development of reliable, emotion-aware systems for safety-critical, interactive, and assistive technologies.


RELATED ABSTRACTS