[2505.16056] Not All Models Suit Expert Offloading: On Local Routing Consistency of Mixture-of-Expert Models
About this article
Abstract page for arXiv paper 2505.16056: Not All Models Suit Expert Offloading: On Local Routing Consistency of Mixture-of-Expert Models
Computer Science > Machine Learning arXiv:2505.16056 (cs) [Submitted on 21 May 2025 (v1), last revised 28 Feb 2026 (this version, v4)] Title:Not All Models Suit Expert Offloading: On Local Routing Consistency of Mixture-of-Expert Models Authors:Jingcong Liang, Siyuan Wang, Miren Tian, Yitong Li, Duyu Tang, Zhongyu Wei View a PDF of the paper titled Not All Models Suit Expert Offloading: On Local Routing Consistency of Mixture-of-Expert Models, by Jingcong Liang and 5 other authors View PDF Abstract:Mixture-of-Experts (MoE) enables efficient scaling of large language models (LLMs) with sparsely activated experts during inference. To effectively deploy large MoE models on memory-constrained devices, many systems introduce *expert offloading* that caches a subset of experts in fast memory, leaving others on slow memory to run on CPU or load on demand. While some research has exploited the locality of expert activations, where consecutive tokens activate similar experts, the degree of this **local routing consistency** varies across models and remains understudied. In this paper, we propose two metrics to measure local routing consistency of MoE models: (1) **Segment Routing Best Performance (SRP)**, which evaluates how well a fixed group of experts can cover the needs of a segment of tokens, and (2) **Segment Cache Best Hit Rate (SCH)**, which measures the hit rate of an expert cache utilizing a length of future information under a cache limit. We analyze 20 MoE LLMs with div...