[2507.00390] MoNE: Replacing Redundant Experts with Lightweight Novices for Structured Pruning of MoE
Summary
The paper introduces MoNE, a novel method for structured pruning of Mixture-of-Experts (MoE) models, replacing redundant experts with lightweight novices to enhance model efficiency while minimizing performance degradation.
Why It Matters
As large language models grow in complexity, efficient resource management becomes crucial. MoNE addresses the memory overhead associated with MoE models by proposing a method that maintains performance while reducing redundancy. This innovation is significant for researchers and practitioners aiming to optimize AI models without sacrificing accuracy.
Key Takeaways
- MoNE effectively replaces redundant experts in MoE models with lightweight novices.
- The method demonstrates minimal accuracy degradation while achieving significant memory savings.
- Extensive experiments show MoNE outperforms existing pruning methods across multiple tasks.
Computer Science > Machine Learning arXiv:2507.00390 (cs) [Submitted on 1 Jul 2025 (v1), last revised 22 Feb 2026 (this version, v2)] Title:MoNE: Replacing Redundant Experts with Lightweight Novices for Structured Pruning of MoE Authors:Geng Zhang, Yuxuan Han, Yuxuan Lou, Yiqi Zhang, Wangbo Zhao, Yang You View a PDF of the paper titled MoNE: Replacing Redundant Experts with Lightweight Novices for Structured Pruning of MoE, by Geng Zhang and 5 other authors View PDF HTML (experimental) Abstract:Mixture-of-Experts (MoE) enables efficient scaling of large language models by activating only a subset of experts per input token. However, deploying MoE-based models incurs significant memory overhead due to the need to retain all experts in memory. While structured pruning is promising to reduce memory costs, existing methods often show suboptimal performance and unstable degradation in three dimensions: model architectures, calibration data sources, and calibration sample sizes. This paper proposes Mixture-of-Novices-and-Experts (MoNE), a novel expert pruning method that replaces redundant experts with lightweight novices to achieve effective and robust model compression. MoNE evaluates expert redundancy based on two metrics: access frequency and output variance. Experts exhibiting low usage and stable outputs are pruned and replaced with lightweight novices-unbiased estimations of their original outputs-minimizing performance degradation. Extensive experiments demonstrate that ...