[2507.06567] SlimCaching: Edge Caching of Mixture-of-Experts for Distributed Inference
About this article
Abstract page for arXiv paper 2507.06567: SlimCaching: Edge Caching of Mixture-of-Experts for Distributed Inference
Computer Science > Machine Learning arXiv:2507.06567 (cs) [Submitted on 9 Jul 2025 (v1), last revised 1 Mar 2026 (this version, v3)] Title:SlimCaching: Edge Caching of Mixture-of-Experts for Distributed Inference Authors:Qian Chen, Xianhao Chen, Kaibin Huang View a PDF of the paper titled SlimCaching: Edge Caching of Mixture-of-Experts for Distributed Inference, by Qian Chen and 2 other authors View PDF HTML (experimental) Abstract:Mixture-of-Experts (MoE) models improve the scalability of large language models (LLMs) by activating only a small subset of relevant experts per input. However, the sheer number of expert networks in an MoE model introduces a significant storage/memory burden for an edge device. To address this challenge, we consider a scenario where experts are dispersed across an edge network for distributed inference. Based on the popular Top-$K$ expert selection strategy, we formulate a latency minimization problem by optimizing expert caching on edge servers under storage constraints. When $K=1$, the problem reduces to a monotone submodular maximization problem with knapsack constraints, for which we design a greedy-based algorithm with a $(1 - 1/e)$-approximation guarantee. For the general case where $K\geq1$, expert co-activation within the same MoE layer introduces non-submodularity, which renders greedy methods ineffective. To tackle this issue, we propose a successive greedy decomposition method to decompose the original problem into a series of subpr...