[2602.01203] Attention Sink Forges Native MoE in Attention Layers: Sink-Aware Training to Address Head Collapse

[2602.01203] Attention Sink Forges Native MoE in Attention Layers: Sink-Aware Training to Address Head Collapse

arXiv - Machine Learning 4 min read

About this article

Abstract page for arXiv paper 2602.01203: Attention Sink Forges Native MoE in Attention Layers: Sink-Aware Training to Address Head Collapse

Computer Science > Computation and Language arXiv:2602.01203 (cs) [Submitted on 1 Feb 2026 (v1), last revised 3 May 2026 (this version, v2)] Title:Attention Sink Forges Native MoE in Attention Layers: Sink-Aware Training to Address Head Collapse Authors:Zizhuo Fu, Wenxuan Zeng, Runsheng Wang, Meng Li View a PDF of the paper titled Attention Sink Forges Native MoE in Attention Layers: Sink-Aware Training to Address Head Collapse, by Zizhuo Fu and 3 other authors View PDF HTML (experimental) Abstract:Large Language Models (LLMs) often assign disproportionate attention to the first token, a phenomenon known as the attention sink. Several recent approaches aim to address this issue, including Sink Attention in GPT-OSS and Gated Attention in Qwen3-Next. However, a comprehensive analysis of the relationship among these attention mechanisms is lacking. In this work, we provide both theoretical and empirical evidence demonstrating that the sink in Vanilla Attention and Sink Attention naturally construct a Mixture-of-Experts (MoE) mechanism within attention layers. This insight explains the head collapse phenomenon observed in prior work, where only a fixed subset of attention heads contributes to generation. To mitigate head collapse, we propose a sink-aware training algorithm with an auxiliary load balancing loss designed for attention layers. Extensive experiments show that our method achieves effective head load balancing and improves model performance across Vanilla Attention,...

Originally published on May 05, 2026. Curated by AI News.

Related Articles

[2602.07238] Is there "Secret Sauce'' in Large Language Model Development?
Llms

[2602.07238] Is there "Secret Sauce'' in Large Language Model Development?

Abstract page for arXiv paper 2602.07238: Is there "Secret Sauce'' in Large Language Model Development?

arXiv - Machine Learning · 3 min ·
[2601.01322] LinMU: Multimodal Understanding Made Linear
Llms

[2601.01322] LinMU: Multimodal Understanding Made Linear

Abstract page for arXiv paper 2601.01322: LinMU: Multimodal Understanding Made Linear

arXiv - Machine Learning · 4 min ·
[2512.05525] Poodle: Seamlessly Scaling Down Large Language Models with Just-in-Time Model Replacement
Llms

[2512.05525] Poodle: Seamlessly Scaling Down Large Language Models with Just-in-Time Model Replacement

Abstract page for arXiv paper 2512.05525: Poodle: Seamlessly Scaling Down Large Language Models with Just-in-Time Model Replacement

arXiv - Machine Learning · 4 min ·
[2511.21678] Agentic Learner with Grow-and-Refine Multimodal Semantic Memory
Llms

[2511.21678] Agentic Learner with Grow-and-Refine Multimodal Semantic Memory

Abstract page for arXiv paper 2511.21678: Agentic Learner with Grow-and-Refine Multimodal Semantic Memory

arXiv - Machine Learning · 4 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime