[2602.15521] ExpertWeaver: Unlocking the Inherent MoE in Dense LLMs with GLU Activation Patterns

[2602.15521] ExpertWeaver: Unlocking the Inherent MoE in Dense LLMs with GLU Activation Patterns

arXiv - Machine Learning 4 min read Article

Summary

The paper presents ExpertWeaver, a framework that enhances the conversion of dense LLMs into sparse Mixture-of-Experts (MoE) models using Gated Linear Unit (GLU) activation patterns, achieving superior performance without the need for extensive training.

Why It Matters

As LLMs grow in complexity, efficient scaling is crucial. ExpertWeaver addresses the challenge of converting dense models to MoE architectures, which can improve computational efficiency and model performance, making it relevant for researchers and practitioners in machine learning and AI.

Key Takeaways

  • ExpertWeaver utilizes GLU activation patterns for efficient MoE conversion.
  • The framework allows for training-free dynamic structural pruning and improved initialization.
  • ExpertWeaver outperforms existing dense-to-MoE methods in terms of performance.

Computer Science > Computation and Language arXiv:2602.15521 (cs) [Submitted on 17 Feb 2026] Title:ExpertWeaver: Unlocking the Inherent MoE in Dense LLMs with GLU Activation Patterns Authors:Ziyu Zhao, Tong Zhu, Zhi Zhang, Tiantian Fan, Jinluan Yang, Kun Kuang, Zhongyu Wei, Fei Wu, Yu Cheng View a PDF of the paper titled ExpertWeaver: Unlocking the Inherent MoE in Dense LLMs with GLU Activation Patterns, by Ziyu Zhao and 8 other authors View PDF HTML (experimental) Abstract:Mixture-of-Experts (MoE) effectively scales model capacity while preserving computational efficiency through sparse expert activation. However, training high-quality MoEs from scratch is prohibitively expensive. A promising alternative is to convert pretrained dense models into sparse MoEs. Existing dense-to-MoE methods fall into two categories: \textbf{dynamic structural pruning} that converts dense models into MoE architectures with moderate sparsity to balance performance and inference efficiency, and \textbf{downcycling} approaches that use pretrained dense models to initialize highly sparse MoE architectures. However, existing methods break the intrinsic activation patterns within dense models, leading to suboptimal expert construction. In this work, we argue that the Gated Linear Unit (GLU) mechanism provides a natural blueprint for dense-to-MoE conversion. We show that the fine-grained neural-wise activation patterns of GLU reveal a coarse-grained structure, uncovering an inherent MoE architectur...

Related Articles

Llms

One of The Worst AI's I've Ever Seen

I'm using Gemini just for they gave us a student-free-pro pack. It can't see the images I sent, most of the time it just rewrites the mes...

Reddit - Artificial Intelligence · 1 min ·
Llms

Claude Opus 4.6 API at 40% below Anthropic pricing – try free before you pay anything

Hey everyone 👋 I've set up a self-hosted API gateway using New-API to manage and distribute Claude Opus 4.6 access across multiple users....

Reddit - Artificial Intelligence · 1 min ·
Llms

The open-source AI system that beat Claude Sonnet on a $500 GPU just shipped a coding assistant

A week or two ago, an open-source project called ATLAS made the rounds for scoring 74.6% on LiveCodeBench with a frozen 9B model on a sin...

Reddit - Artificial Intelligence · 1 min ·
Llms

Claude Max 20x usage hit 40% by Monday noon — how does Codex CLI compare?

I'm on Claude Max (the $100/mo plan) and noticed something that surprised me. By Monday noon I had already used 40% of the 20x monthly li...

Reddit - Artificial Intelligence · 1 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime