[2510.07978] VoiceAgentBench: Are Voice Assistants ready for agentic tasks?
Summary
The paper introduces VoiceAgentBench, a benchmark for evaluating voice assistants' capabilities in agentic tasks, highlighting their performance and limitations across various languages.
Why It Matters
As voice assistants become integral to daily tasks, understanding their effectiveness in complex scenarios is crucial. This research addresses gaps in current evaluation methods, providing a framework for assessing their performance in real-world applications, particularly in multilingual contexts.
Key Takeaways
- VoiceAgentBench evaluates voice assistants in realistic spoken settings.
- ASR-LLM pipelines outperform end-to-end SpeechLMs in agentic tasks.
- Performance varies significantly across languages, with challenges in Indic languages.
- Sequential workflows and safety evaluations reveal persistent limitations.
- The benchmark is publicly available, promoting further research and development.
Computer Science > Artificial Intelligence arXiv:2510.07978 (cs) [Submitted on 9 Oct 2025 (v1), last revised 13 Feb 2026 (this version, v3)] Title:VoiceAgentBench: Are Voice Assistants ready for agentic tasks? Authors:Dhruv Jain, Harshit Shukla, Gautam Rajeev, Ashish Kulkarni, Chandra Khatri, Shubham Agarwal View a PDF of the paper titled VoiceAgentBench: Are Voice Assistants ready for agentic tasks?, by Dhruv Jain and 5 other authors View PDF HTML (experimental) Abstract:Large scale Speech Language Models have enabled voice assistants capable of understanding natural spoken queries and performing complex tasks. However, existing speech benchmarks largely focus on isolated capabilities such as transcription or question answering and do not systematically evaluate agentic behavior or adversarial robustness. To address this, we introduce VoiceAgentBench, a comprehensive benchmark for evaluating SpeechLMs in realistic spoken agentic settings, comprising 6,000+ synthetic spoken queries spanning single-tool invocations, multi-tool workflows, multi-turn dialogue, and safety evaluations across English and six Indic languages. To ensure speaker diversity, we further simulate speaker variability using a novel sampling strategy that selects audios for TTS voice conversion based on speaker embeddings to maximize acoustic diversity. Our evaluation measures tool selection accuracy, structural consistency, and the correctness of tool invocations, including adversarial robustness. Across...