[2602.14159] Synergistic Intra- and Cross-Layer Regularization Losses for MoE Expert Specialization
Summary
This paper presents two novel regularization losses for enhancing the specialization of Sparse Mixture-of-Experts (MoE) models, improving routing efficiency without requiring architectural changes.
Why It Matters
The advancements in MoE models are crucial for optimizing deep learning architectures, particularly in scaling Transformers. By addressing expert overlap and routing ambiguity, this research contributes to more efficient model training and inference, which is vital for applications in machine learning and AI.
Key Takeaways
- Introduces intra-layer and cross-layer regularization losses for MoE models.
- Enhances expert specialization and routing efficiency without modifying architectures.
- Demonstrates consistent task gains and lower-entropy routing through extensive experiments.
- Implements as a drop-in module for Megatron-LM, facilitating easy integration.
- Contributes to faster inference via more stable expert pathways.
Computer Science > Machine Learning arXiv:2602.14159 (cs) [Submitted on 15 Feb 2026] Title:Synergistic Intra- and Cross-Layer Regularization Losses for MoE Expert Specialization Authors:Rizhen Hu, Yuan Cao, Boao Kong, Mou Sun, Kun Yuan View a PDF of the paper titled Synergistic Intra- and Cross-Layer Regularization Losses for MoE Expert Specialization, by Rizhen Hu and 4 other authors View PDF HTML (experimental) Abstract:Sparse Mixture-of-Experts (MoE) models scale Transformers efficiently but suffer from expert overlap -- redundant representations across experts and routing ambiguity, resulting in severely underutilized model capacity. While architectural solutions like DeepSeekMoE promote specialization, they require substantial structural modifications and rely solely on intra-layer signals. In this paper, we propose two plug-and-play regularization losses that enhance MoE specialization and routing efficiency without modifying router or model architectures. First, an intra-layer specialization loss penalizes cosine similarity between experts' SwiGLU activations on identical tokens, encouraging experts to specialize in complementary knowledge. Second, a cross-layer coupling loss maximizes joint Top-$k$ routing probabilities across adjacent layers, establishing coherent expert pathways through network depth while reinforcing intra-layer expert specialization. Both losses are orthogonal to the standard load-balancing loss and compatible with both the shared-expert archit...