[2602.01203] Attention Sink Forges Native MoE in Attention Layers: Sink-Aware Training to Address Head Collapse
About this article
Abstract page for arXiv paper 2602.01203: Attention Sink Forges Native MoE in Attention Layers: Sink-Aware Training to Address Head Collapse
Computer Science > Computation and Language arXiv:2602.01203 (cs) [Submitted on 1 Feb 2026 (v1), last revised 3 May 2026 (this version, v2)] Title:Attention Sink Forges Native MoE in Attention Layers: Sink-Aware Training to Address Head Collapse Authors:Zizhuo Fu, Wenxuan Zeng, Runsheng Wang, Meng Li View a PDF of the paper titled Attention Sink Forges Native MoE in Attention Layers: Sink-Aware Training to Address Head Collapse, by Zizhuo Fu and 3 other authors View PDF HTML (experimental) Abstract:Large Language Models (LLMs) often assign disproportionate attention to the first token, a phenomenon known as the attention sink. Several recent approaches aim to address this issue, including Sink Attention in GPT-OSS and Gated Attention in Qwen3-Next. However, a comprehensive analysis of the relationship among these attention mechanisms is lacking. In this work, we provide both theoretical and empirical evidence demonstrating that the sink in Vanilla Attention and Sink Attention naturally construct a Mixture-of-Experts (MoE) mechanism within attention layers. This insight explains the head collapse phenomenon observed in prior work, where only a fixed subset of attention heads contributes to generation. To mitigate head collapse, we propose a sink-aware training algorithm with an auxiliary load balancing loss designed for attention layers. Extensive experiments show that our method achieves effective head load balancing and improves model performance across Vanilla Attention,...