[2602.13685] AuTAgent: A Reinforcement Learning Framework for Tool-Augmented Audio Reasoning
Summary
AuTAgent introduces a reinforcement learning framework designed to enhance audio reasoning by effectively integrating external tools, improving accuracy in audio language models.
Why It Matters
This research addresses the limitations of large audio language models (LALMs) in complex reasoning tasks. By proposing a framework that intelligently selects tools based on context, it enhances the performance of audio models, which is crucial for applications in AI-driven audio analysis and processing.
Key Takeaways
- AuTAgent improves audio reasoning by learning when to invoke external tools.
- The framework uses a novel Differential Reward mechanism for sparse feedback.
- Experimental results show significant accuracy improvements across benchmarks.
- AuTAgent demonstrates strong transferability in various audio tasks.
- The research highlights the importance of external tools in enhancing model performance.
Computer Science > Sound arXiv:2602.13685 (cs) [Submitted on 14 Feb 2026] Title:AuTAgent: A Reinforcement Learning Framework for Tool-Augmented Audio Reasoning Authors:Siqian Tong, Xuan Li, Yiwei Wang, Baolong Bi, Yujun Cai, Shenghua Liu, Yuchen He, Chengpeng Hao View a PDF of the paper titled AuTAgent: A Reinforcement Learning Framework for Tool-Augmented Audio Reasoning, by Siqian Tong and 7 other authors View PDF HTML (experimental) Abstract:Large Audio Language Models (LALMs) excel at perception but struggle with complex reasoning requiring precise acoustic measurements. While external tools can extract fine-grained features like exact tempo or pitch, effective integration remains challenging: naively using all tools causes information overload, while prompt-based selection fails to assess context-dependent utility. To address this, we propose AuTAgent (Audio Tool Agent), a reinforcement learning framework that learns when and which tools to invoke. By employing a sparse-feedback training strategy with a novel Differential Reward mechanism, the agent learns to filter out irrelevant tools and invokes external assistance only when it yields a net performance gain over the base model. Experimental results confirm that AuTAgent complements the representation bottleneck of LALMs by providing verifiable acoustic evidence. It improves accuracy by 4.20% / 6.20% and 9.80% / 8.00% for open-source and closed-source backbones on the MMAU Test-mini and the MMAR benchmarks, respecti...