[2602.21371] Interleaved Head Attention
Summary
The paper introduces Interleaved Head Attention (IHA), a novel approach to Multi-Head Attention (MHA) that enhances reasoning capabilities in Large Language Models by enabling cross-head interactions, improving efficiency and performance on various benchmarks.
Why It Matters
As Large Language Models continue to evolve, addressing the limitations of traditional Multi-Head Attention is crucial for enhancing their reasoning abilities. IHA offers a promising solution by allowing better aggregation of information across attention heads, which is vital for complex tasks requiring multi-step reasoning.
Key Takeaways
- Interleaved Head Attention (IHA) allows for cross-head mixing, improving multi-step reasoning.
- IHA reduces parameter overhead while enhancing attention patterns, making it more efficient than traditional MHA.
- Real-world benchmarks show IHA improves performance on tasks like Multi-Key retrieval and reasoning benchmarks.
Computer Science > Machine Learning arXiv:2602.21371 (cs) [Submitted on 24 Feb 2026] Title:Interleaved Head Attention Authors:Sai Surya Duvvuri, Chanakya Ekbote, Rachit Bansal, Rishabh Tiwari, Devvrit Khatri, David Brandfonbrener, Paul Liang, Inderjit Dhillon, Manzil Zaheer View a PDF of the paper titled Interleaved Head Attention, by Sai Surya Duvvuri and 8 other authors View PDF HTML (experimental) Abstract:Multi-Head Attention (MHA) is the core computational primitive underlying modern Large Language Models (LLMs). However, MHA suffers from a fundamental linear scaling limitation: $H$ attention heads produce exactly $H$ independent attention matrices, with no communication between heads during attention computation. This becomes problematic for multi-step reasoning, where correct answers depend on aggregating evidence from multiple parts of the context and composing latent token-to-token relations over a chain of intermediate inferences. To address this, we propose Interleaved Head Attention (IHA), which enables cross-head mixing by constructing $P$ pseudo-heads per head (typically $P=H$), where each pseudo query/key/value is a learned linear combination of all $H$ original queries, keys and values respectively. Interactions between pseudo-query and pseudo-key heads induce up to $P^2$ attention patterns per head with modest parameter overhead $\mathcal{O}(H^2P)$. We provide theory showing improved efficiency in terms of number of parameters on the synthetic Polynomial t...