[2602.15521] ExpertWeaver: Unlocking the Inherent MoE in Dense LLMs with GLU Activation Patterns
Summary
The paper presents ExpertWeaver, a framework that enhances the conversion of dense LLMs into sparse Mixture-of-Experts (MoE) models using Gated Linear Unit (GLU) activation patterns, achieving superior performance without the need for extensive training.
Why It Matters
As LLMs grow in complexity, efficient scaling is crucial. ExpertWeaver addresses the challenge of converting dense models to MoE architectures, which can improve computational efficiency and model performance, making it relevant for researchers and practitioners in machine learning and AI.
Key Takeaways
- ExpertWeaver utilizes GLU activation patterns for efficient MoE conversion.
- The framework allows for training-free dynamic structural pruning and improved initialization.
- ExpertWeaver outperforms existing dense-to-MoE methods in terms of performance.
Computer Science > Computation and Language arXiv:2602.15521 (cs) [Submitted on 17 Feb 2026] Title:ExpertWeaver: Unlocking the Inherent MoE in Dense LLMs with GLU Activation Patterns Authors:Ziyu Zhao, Tong Zhu, Zhi Zhang, Tiantian Fan, Jinluan Yang, Kun Kuang, Zhongyu Wei, Fei Wu, Yu Cheng View a PDF of the paper titled ExpertWeaver: Unlocking the Inherent MoE in Dense LLMs with GLU Activation Patterns, by Ziyu Zhao and 8 other authors View PDF HTML (experimental) Abstract:Mixture-of-Experts (MoE) effectively scales model capacity while preserving computational efficiency through sparse expert activation. However, training high-quality MoEs from scratch is prohibitively expensive. A promising alternative is to convert pretrained dense models into sparse MoEs. Existing dense-to-MoE methods fall into two categories: \textbf{dynamic structural pruning} that converts dense models into MoE architectures with moderate sparsity to balance performance and inference efficiency, and \textbf{downcycling} approaches that use pretrained dense models to initialize highly sparse MoE architectures. However, existing methods break the intrinsic activation patterns within dense models, leading to suboptimal expert construction. In this work, we argue that the Gated Linear Unit (GLU) mechanism provides a natural blueprint for dense-to-MoE conversion. We show that the fine-grained neural-wise activation patterns of GLU reveal a coarse-grained structure, uncovering an inherent MoE architectur...