[2602.01203] Attention Sink Forges Native MoE in Attention Layers:

[2602.01203] Attention Sink Forges Native MoE in Attention Layers: Sink-Aware Training to Address Head Collapse

arXiv - Machine Learning May 05, 2026 4 min read

About this article

Abstract page for arXiv paper 2602.01203: Attention Sink Forges Native MoE in Attention Layers: Sink-Aware Training to Address Head Collapse

Computer Science > Computation and Language arXiv:2602.01203 (cs) [Submitted on 1 Feb 2026 (v1), last revised 3 May 2026 (this version, v2)] Title:Attention Sink Forges Native MoE in Attention Layers: Sink-Aware Training to Address Head Collapse Authors:Zizhuo Fu, Wenxuan Zeng, Runsheng Wang, Meng Li View a PDF of the paper titled Attention Sink Forges Native MoE in Attention Layers: Sink-Aware Training to Address Head Collapse, by Zizhuo Fu and 3 other authors View PDF HTML (experimental) Abstract:Large Language Models (LLMs) often assign disproportionate attention to the first token, a phenomenon known as the attention sink. Several recent approaches aim to address this issue, including Sink Attention in GPT-OSS and Gated Attention in Qwen3-Next. However, a comprehensive analysis of the relationship among these attention mechanisms is lacking. In this work, we provide both theoretical and empirical evidence demonstrating that the sink in Vanilla Attention and Sink Attention naturally construct a Mixture-of-Experts (MoE) mechanism within attention layers. This insight explains the head collapse phenomenon observed in prior work, where only a fixed subset of attention heads contributes to generation. To mitigate head collapse, we propose a sink-aware training algorithm with an auxiliary load balancing loss designed for attention layers. Extensive experiments show that our method achieves effective head load balancing and improves model performance across Vanilla Attention,...

Originally published on May 05, 2026. Curated by AI News.

Llms

[2602.07238] Is there "Secret Sauce'' in Large Language Model Development?

Abstract page for arXiv paper 2602.07238: Is there "Secret Sauce'' in Large Language Model Development?

arXiv - Machine Learning · 3 min · about 5 hours ago

Llms

[2601.01322] LinMU: Multimodal Understanding Made Linear

Abstract page for arXiv paper 2601.01322: LinMU: Multimodal Understanding Made Linear

arXiv - Machine Learning · 4 min · about 5 hours ago

Llms

[2512.05525] Poodle: Seamlessly Scaling Down Large Language Models with Just-in-Time Model Replacement

Abstract page for arXiv paper 2512.05525: Poodle: Seamlessly Scaling Down Large Language Models with Just-in-Time Model Replacement

arXiv - Machine Learning · 4 min · about 5 hours ago

Llms

[2511.21678] Agentic Learner with Grow-and-Refine Multimodal Semantic Memory

Abstract page for arXiv paper 2511.21678: Agentic Learner with Grow-and-Refine Multimodal Semantic Memory

arXiv - Machine Learning · 4 min · about 5 hours ago

[2602.01203] Attention Sink Forges Native MoE in Attention Layers: Sink-Aware Training to Address Head Collapse

About this article

Related Articles

[2602.07238] Is there "Secret Sauce'' in Large Language Model Development?

[2601.01322] LinMU: Multimodal Understanding Made Linear

[2512.05525] Poodle: Seamlessly Scaling Down Large Language Models with Just-in-Time Model Replacement

[2511.21678] Agentic Learner with Grow-and-Refine Multimodal Semantic Memory

No comments

Stay updated with AI News