[2602.12556] SD-MoE: Spectral Decomposition for Effective Expert Specialization
Summary
The paper introduces SD-MoE, a method to enhance expert specialization in Mixture-of-Experts architectures by utilizing spectral decomposition to improve model performance with minimal additional computation.
Why It Matters
As large language models (LLMs) grow in complexity, effective expert specialization is crucial for optimizing performance. This research addresses common pitfalls in existing MoE architectures, providing a novel approach that can be integrated into various systems, thus advancing the field of machine learning.
Key Takeaways
- SD-MoE enhances expert specialization in Mixture-of-Experts models.
- The method addresses overlapping spectral components and gradient alignment issues.
- It incurs minimal additional computation while improving performance.
- SD-MoE can be integrated into existing architectures like Qwen and DeepSeek.
- The findings highlight the importance of spectral analysis in optimizing AI models.
Computer Science > Machine Learning arXiv:2602.12556 (cs) [Submitted on 13 Feb 2026] Title:SD-MoE: Spectral Decomposition for Effective Expert Specialization Authors:Ruijun Huang, Fang Dong, Xin Zhang, Hengjie Cao, Zhendong Huang, Anrui Chen, Jixian Zhou, Mengyi Chen, Yifeng Yang, Mingzhi Dong, Yujiang Wang, Jinlong Hou, Qin Lv, Robert P. Dick, Yuan Cheng, Fan Yang, Tun Lu, Chun Zhang, Li Shang View a PDF of the paper titled SD-MoE: Spectral Decomposition for Effective Expert Specialization, by Ruijun Huang and 18 other authors View PDF HTML (experimental) Abstract:Mixture-of-Experts (MoE) architectures scale Large Language Models via expert specialization induced by conditional computation. In practice, however, expert specialization often fails: some experts become functionally similar, while others functioning as de facto shared experts, limiting the effective capacity and model performance. In this work, we analysis from a spectral perspective on parameter and gradient spaces, uncover that (1) experts share highly overlapping dominant spectral components in their parameters, (2) dominant gradient subspaces are strongly aligned across experts, driven by ubiquitous low-rank structure in human corpus, and (3) gating mechanisms preferentially route inputs along these dominant directions, further limiting specialization. To address this, we propose Spectral-Decoupled MoE (SD-MoE), which decomposes both parameter and gradient in the spectral space. SD-MoE improves performanc...