[2602.11909] Echo: Towards Advanced Audio Comprehension via Audio-Interleaved Reasoning
About this article
Abstract page for arXiv paper 2602.11909: Echo: Towards Advanced Audio Comprehension via Audio-Interleaved Reasoning
Computer Science > Sound arXiv:2602.11909 (cs) [Submitted on 12 Feb 2026 (v1), last revised 28 Feb 2026 (this version, v2)] Title:Echo: Towards Advanced Audio Comprehension via Audio-Interleaved Reasoning Authors:Daiqing Wu, Xuan Zhang, Dongbao Yang, Jiashu Yao, Longfei Chen, Qingsong Liu, Sicheng Zhao, Can Ma, Yangyang Kang, Yu Zhou View a PDF of the paper titled Echo: Towards Advanced Audio Comprehension via Audio-Interleaved Reasoning, by Daiqing Wu and 9 other authors View PDF HTML (experimental) Abstract:The maturation of Large Audio Language Models (LALMs) has raised growing expectations for them to comprehend complex audio much like humans. Current efforts primarily replicate text-based reasoning by contextualizing audio content through a one-time encoding, which introduces a critical information bottleneck. Drawing inspiration from human cognition, we propose audio-interleaved reasoning to break through this bottleneck. It treats audio as an active reasoning component, enabling sustained audio engagement and perception-grounded analysis. To instantiate it, we introduce a two-stage training framework, first teaching LALMs to localize salient audio segments through supervised fine-tuning, and then incentivizing proficient re-listening via reinforcement learning. In parallel, a structured data generation pipeline is developed to produce high-quality training data. Consequently, we present Echo, a LALM capable of dynamically re-listening to audio in demand during reaso...