[2603.23057] Prompt Amplification and Zero-Shot Late Fusion in Audio-Language Models for Speech Emotion Recognition
About this article
Abstract page for arXiv paper 2603.23057: Prompt Amplification and Zero-Shot Late Fusion in Audio-Language Models for Speech Emotion Recognition
Electrical Engineering and Systems Science > Audio and Speech Processing arXiv:2603.23057 (eess) [Submitted on 24 Mar 2026] Title:Prompt Amplification and Zero-Shot Late Fusion in Audio-Language Models for Speech Emotion Recognition Authors:Saurabh Kataria, Xiao Hu View a PDF of the paper titled Prompt Amplification and Zero-Shot Late Fusion in Audio-Language Models for Speech Emotion Recognition, by Saurabh Kataria and 1 other authors View PDF HTML (experimental) Abstract:Audio-Language Models (ALMs) are making strides in understanding speech and non-speech audio. However, domain-specialist Foundation Models (FMs) remain the best for closed-ended speech processing tasks such as Speech Emotion Recognition (SER). Using ALMs for Zero-shot SER is a popular choice, but their potential to work with specialists to achieve state-of-the-art (SOTA) performance remains unexplored. We propose ZS-Fuse, a late-fusion method that combines zero-shot emotion estimates from a dual-encoder ALM with specialist FMs. To handle ambiguity in emotions and sensitivity to prompt choice, 1) we use a simple prompt ensemble and 2) suggest a novel technique called prompt amplification, which repeats audio and text queries to discover stronger zero-shot capabilities. We demonstrate the efficacy of our technique by evaluating ZS-Fuse with three dual-encoder ALMs and two FMs, and report improvements over SOTA baselines, such as WavLM-Large, on three speech emotion recognition datasets. Subjects: Audio and...