[2602.15318] Sparrow: Text-Anchored Window Attention with Visual-Semantic Glimpsing for Speculative Decoding in Video LLMs
Summary
The paper introduces Sparrow, a novel framework designed to enhance speculative decoding in Video Large Language Models (Vid-LLMs) by optimizing attention mechanisms and improving computational efficiency.
Why It Matters
As video content becomes increasingly prevalent, optimizing the performance of Vid-LLMs is crucial for real-time applications. Sparrow addresses significant challenges in attention mechanisms, potentially leading to advancements in video processing and AI applications.
Key Takeaways
- Sparrow framework improves speculative decoding in Vid-LLMs.
- Utilizes text-anchored window attention to enhance efficiency.
- Achieves an average speedup of 2.82x in processing long video sequences.
- Addresses performance degradation through visual state bridging.
- Introduces a multi-token prediction strategy for better training-inference alignment.
Computer Science > Computer Vision and Pattern Recognition arXiv:2602.15318 (cs) [Submitted on 17 Feb 2026] Title:Sparrow: Text-Anchored Window Attention with Visual-Semantic Glimpsing for Speculative Decoding in Video LLMs Authors:Libo Zhang, Zhaoning Zhang, Wangyang Hong, Peng Qiao, Dongsheng Li View a PDF of the paper titled Sparrow: Text-Anchored Window Attention with Visual-Semantic Glimpsing for Speculative Decoding in Video LLMs, by Libo Zhang and 4 other authors View PDF HTML (experimental) Abstract:Although speculative decoding is widely used to accelerate Vision-Language Models (VLMs) inference, it faces severe performance collapse when applied to Video Large Language Models (Vid-LLMs). The draft model typically falls into the trap of attention dilution and negative visual gain due to key-value cache explosion and context window mismatches. We observe a visual semantic internalization phenomenon in Vid-LLMs, indicating that critical visual semantics are implicitly encoded into text hidden states during deep-layer interactions, which renders raw visual inputs structurally redundant during deep inference. To address this, we propose the Sparrow framework, which first utilizes visually-aware text-anchored window attention via hidden state reuse to fully offload visual computation to the target model, and leverages intermediate-layer visual state bridging to train the draft model with semantic-rich intermediate states, thereby filtering out low-level visual noise. Ad...