[2508.06526] PiKV: KV Cache Management System for Mixture of Experts
About this article
Abstract page for arXiv paper 2508.06526: PiKV: KV Cache Management System for Mixture of Experts
Computer Science > Distributed, Parallel, and Cluster Computing arXiv:2508.06526 (cs) [Submitted on 2 Aug 2025 (v1), last revised 28 Feb 2026 (this version, v2)] Title:PiKV: KV Cache Management System for Mixture of Experts Authors:Dong Liu, Yanxuan Yu, Ben Lengerich, Ying Nian Wu, Xuhong Wang View a PDF of the paper titled PiKV: KV Cache Management System for Mixture of Experts, by Dong Liu and 4 other authors View PDF HTML (experimental) Abstract:As large language models continue to scale up in both size and context length, the memory and communication cost of key-value (KV) cache storage has become a major bottleneck in multi-GPU and multi-node inference. While MoE-based architectures sparsify computation across experts, the corresponding KV caches remain dense and globally synchronized, resulting in significant overhead. We introduce \textbf{PiKV}, a parallel and distributed KV cache serving framework tailored for MoE architecture. PiKV leverages \textit{expert-sharded KV storage} to partition caches across GPUs, \textit{PiKV routing} to reduce token-to-KV access, and a \textit{PiKV Scheduling} to adaptively retain query-relevant entries. To further reduce memory usage, PiKV integrates \textit{PiKV Compression} modules the caching pipeline for acceleration. PiKV is recently publicly available as an open-source software library: \href{this https URL}{this https URL}. Experiments details is recorded at: \href{this https URL}{this https URL\_Results}. We also have PiKV in...