[2602.13259] Learning Physiology-Informed Vocal Spectrotemporal Representations for Speech Emotion Recognition
Summary
This paper presents PhysioSER, a novel approach for speech emotion recognition that integrates physiological insights into vocal representations, enhancing model interpretability and efficiency.
Why It Matters
Understanding and recognizing emotions in speech is crucial for applications like humanoid robotics and psychological diagnostics. PhysioSER addresses limitations in current models by incorporating physiological features, potentially improving performance and safety in emotional interactions.
Key Takeaways
- PhysioSER integrates vocal amplitude and phase dynamics for better emotion recognition.
- The model is designed to be interpretable and efficient, addressing existing deep model limitations.
- Extensive evaluations demonstrate its effectiveness across multiple datasets and languages.
- Real-time deployment on humanoid robots validates its practical application.
- The approach offers a compact, plug-and-play design suitable for various SER tasks.
Computer Science > Sound arXiv:2602.13259 (cs) [Submitted on 3 Feb 2026] Title:Learning Physiology-Informed Vocal Spectrotemporal Representations for Speech Emotion Recognition Authors:Xu Zhang, Longbing Cao, Runze Yang, Zhangkai Wu View a PDF of the paper titled Learning Physiology-Informed Vocal Spectrotemporal Representations for Speech Emotion Recognition, by Xu Zhang and 3 other authors View PDF Abstract:Speech emotion recognition (SER) is essential for humanoid robot tasks such as social robotic interactions and robotic psychological diagnosis, where interpretable and efficient models are critical for safety and performance. Existing deep models trained on large datasets remain largely uninterpretable, often insufficiently modeling underlying emotional acoustic signals and failing to capture and analyze the core physiology of emotional vocal behaviors. Physiological research on human voices shows that the dynamics of vocal amplitude and phase correlate with emotions through the vocal tract filter and the glottal source. However, most existing deep models solely involve amplitude but fail to couple the physiological features of and between amplitude and phase. Here, we propose PhysioSER, a physiology-informed vocal spectrotemporal representation learning method, to address these issues with a compact, plug-and-play design. PhysioSER constructs amplitude and phase views informed by voice anatomy and physiology (VAP) to complement SSL models for SER. This VAP-informed f...