Presenter: Tanmay Sonawane
Faculty Sponsor: Alfa Heryudono
School: UMass Dartmouth
Research Area: Computer Science
ABSTRACT
Human speech carries rich paralinguistic information, particularly emotion, which provides valuable insight into psychological state, intent, and behavioral response. This project investigates how modern deep learning architectures can detect emotion from speech audio using time-frequency representations stored as numerical arrays. Centered on the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS), the methodology involves systematic preprocessing of raw audio files into normalized Mel-based NumPy (.npy) representations, followed by multimodal learning that jointly processes raw waveforms and spectrogram arrays. A pretrained ResNet-18 architecture is employed as the primary convolutional backbone, implemented within the PyTorch framework. Model performance is evaluated using accuracy, F1-score, precision, recall, and confusion matrices, with an achieved baseline accuracy of approximately 70% on a five-class emotion mapping.
To further assess robustness and generalization, this work will be extended to the Toronto Emotional Speech Set (TESS), enabling cross-dataset evaluation and combined-training strategies. In addition to PyTorch-based models, equivalent architectures will be implemented and tested using TensorFlow to provide a comparative analysis of deep learning frameworks for speech emotion recognition. Differences in training dynamics, performance, and deployment considerations across frameworks will be systematically examined. The ultimate application focus of this research is emergency service call analysis, where real-time emotion detection can assist dispatchers by identifying heightened stress, fear, or distress in callers. By benchmarking models across datasets and frameworks, this project aims to support the development of reliable, emotion-aware systems for safety-critical, interactive, and assistive technologies.
RELATED ABSTRACTS