[2602.12714] ADEPT: RL-Aligned Agentic Decoding of Emotion via Evidence Probing Tools -- From Consensus Learning to Ambiguity-Driven Emotion Reasoning
Summary
The paper introduces ADEPT, a novel framework for emotion recognition that enhances accuracy by integrating acoustic evidence and multi-turn inquiry processes, shifting from consensus learning to ambiguity-driven reasoning.
Why It Matters
This research addresses the limitations of current emotion recognition models by proposing a method that combines self-supervised speech encoders with structured inquiry, improving the interpretability and accuracy of emotional assessments. This is particularly relevant as emotion recognition technology becomes increasingly integrated into AI applications.
Key Takeaways
- ADEPT redefines emotion recognition as a multi-turn inquiry process.
- The framework improves accuracy for both primary and minor emotions.
- It integrates acoustic and semantic probing tools for evidence-based reasoning.
- Minority annotations are treated as valuable signals rather than noise.
- Group Relative Policy Optimization enhances prediction quality by coupling tool usage with evidence.
Computer Science > Machine Learning arXiv:2602.12714 (cs) [Submitted on 13 Feb 2026] Title:ADEPT: RL-Aligned Agentic Decoding of Emotion via Evidence Probing Tools -- From Consensus Learning to Ambiguity-Driven Emotion Reasoning Authors:Esther Sun, Bo-Hao Su, Abinay Reddy Naini, Shinji Watanabe, Carlos Busso View a PDF of the paper titled ADEPT: RL-Aligned Agentic Decoding of Emotion via Evidence Probing Tools -- From Consensus Learning to Ambiguity-Driven Emotion Reasoning, by Esther Sun and 4 other authors View PDF HTML (experimental) Abstract:Speech Large Language Models (SLLMs) enable high-level emotion reasoning but often produce ungrounded, text-biased judgments without verifiable acoustic evidence. In contrast, self-supervised speech encoders such as WavLM provide strong acoustic representations yet remain opaque discriminative models with limited interpretability. To bridge this gap, we introduce ADEPT (Agentic Decoding of Emotion via Evidence Probing Tools), a framework that reframes emotion recognition as a multi-turn inquiry process rather than a single-pass prediction. ADEPT transforms an SLLM into an agent that maintains an evolving candidate emotion set and adaptively invokes dedicated semantic and acoustic probing tools within a structured pipeline of candidate generation, evidence collection, and adjudication. Crucially, ADEPT enables a paradigm shift from consensus learning to ambiguity-driven emotion reasoning. Since human affect exhibits inherent complex...