[2604.02715] FluxMoE: Decoupling Expert Residency for High-Performance MoE Serving

[2604.02715] FluxMoE: Decoupling Expert Residency for High-Performance MoE Serving

arXiv - Machine Learning 3 min read

About this article

Abstract page for arXiv paper 2604.02715: FluxMoE: Decoupling Expert Residency for High-Performance MoE Serving

Computer Science > Machine Learning arXiv:2604.02715 (cs) [Submitted on 3 Apr 2026] Title:FluxMoE: Decoupling Expert Residency for High-Performance MoE Serving Authors:Qingxiu Liu, Cyril Y. He, Hanser Jiang, Zion Wang, Alan Zhao, Patrick P. C. Lee View a PDF of the paper titled FluxMoE: Decoupling Expert Residency for High-Performance MoE Serving, by Qingxiu Liu and 5 other authors View PDF HTML (experimental) Abstract:Mixture-of-Experts (MoE) models have become a dominant paradigm for scaling large language models, but their rapidly growing parameter sizes introduce a fundamental inefficiency during inference: most expert weights remain idle in GPU memory while competing with performance-critical runtime state such as the key-value (KV) cache. Since KV cache capacity directly determines serving throughput, this mismatch leads to underutilized memory and degraded performance. In this paper, we present FluxMoE, a new MoE inference system that decouples expert parameters from persistent GPU residency. FluxMoE introduces an expert paging abstraction that treats expert weights as streamed, transient resources, materializing them on demand and evicting them immediately after use, allowing GPU memory to be preferentially allocated to throughput-critical runtime state. We implement FluxMoE atop vLLM to enable efficient MoE inference under severe memory constraints. Experimental results demonstrate that FluxMoE achieves up to 3.0$\times$ throughput gains over vLLM in memory-intens...

Originally published on April 06, 2026. Curated by AI News.

Related Articles

Llms

We’re open-sourcing a 33-benchmark diagnostic for AI alignment gaps, launches April 27

On April 27 we’re open-sourcing a free diagnostic tool called iFixAi. You run it against your AI system (agent, copilot, LLM integration,...

Reddit - Artificial Intelligence · 1 min ·
Llms

Google’s Gemini AI can answer your questions with 3D models and simulations

submitted by /u/tekz [link] [comments]

Reddit - Artificial Intelligence · 1 min ·
Google’s Gemini AI can answer your questions with 3D models and simulations | The Verge
Llms

Google’s Gemini AI can answer your questions with 3D models and simulations | The Verge

Google is rolling out a new feature for its Gemini AI chatbot, allowing the tool to generate 3D models and simulations to explain the con...

The Verge - AI · 4 min ·
Llms

I compiled every major AI agent security incident from 2024-2026 in one place - 90 incidents, all sourced, updated weekly

After tracking AI agent security incidents for the past year, I put together a single reference covering every major breach, vulnerabilit...

Reddit - Artificial Intelligence · 1 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime