[2603.18048] DEAF: A Benchmark for Diagnostic Evaluation of Acoustic Faithfulness in Audio Language Models
About this article
Abstract page for arXiv paper 2603.18048: DEAF: A Benchmark for Diagnostic Evaluation of Acoustic Faithfulness in Audio Language Models
Computer Science > Artificial Intelligence arXiv:2603.18048 (cs) [Submitted on 17 Mar 2026 (v1), last revised 20 Mar 2026 (this version, v2)] Title:DEAF: A Benchmark for Diagnostic Evaluation of Acoustic Faithfulness in Audio Language Models Authors:Jiaqi Xiong, Yunjia Qi, Qi Cao, Yu Zheng, Yutong Zhang, Ziteng Wang, Ruofan Liao, Weisheng Xu, Sichen Liu View a PDF of the paper titled DEAF: A Benchmark for Diagnostic Evaluation of Acoustic Faithfulness in Audio Language Models, by Jiaqi Xiong and 8 other authors View PDF HTML (experimental) Abstract:Recent Audio Multimodal Large Language Models (Audio MLLMs) demonstrate impressive performance on speech benchmarks, yet it remains unclear whether these models genuinely process acoustic signals or rely on text-based semantic inference. To systematically study this question, we introduce DEAF (Diagnostic Evaluation of Acoustic Faithfulness), a benchmark of over 2,700 conflict stimuli spanning three acoustic dimensions: emotional prosody, background sounds, and speaker identity. Then, we design a controlled multi-level evaluation framework that progressively increases textual influence, ranging from semantic conflicts in the content to misleading prompts and their combination, allowing us to disentangle content-driven bias from prompt-induced sycophancy. We further introduce diagnostic metrics to quantify model reliance on textual cues over acoustic signals. Our evaluation of seven Audio MLLMs reveals a consistent pattern of tex...