[2505.13109] FreeKV: Boosting KV Cache Retrieval for Efficient LLM Inference
About this article
Abstract page for arXiv paper 2505.13109: FreeKV: Boosting KV Cache Retrieval for Efficient LLM Inference
Computer Science > Machine Learning arXiv:2505.13109 (cs) [Submitted on 19 May 2025 (v1), last revised 28 Feb 2026 (this version, v4)] Title:FreeKV: Boosting KV Cache Retrieval for Efficient LLM Inference Authors:Guangda Liu, Chengwei Li, Zhenyu Ning, Jing Lin, Yiwu Yao, Danning Ke, Minyi Guo, Jieru Zhao View a PDF of the paper titled FreeKV: Boosting KV Cache Retrieval for Efficient LLM Inference, by Guangda Liu and 7 other authors View PDF HTML (experimental) Abstract:Large language models (LLMs) are widely deployed with rapidly expanding context windows to support increasingly demanding applications. However, long contexts pose significant deployment challenges, primarily due to the KV cache whose size grows proportionally with context length. While KV cache compression methods have been proposed to address this issue, KV dropping methods incur considerable accuracy loss, and KV retrieval methods suffer from significant efficiency bottlenecks. We propose FreeKV, a training-free algorithm-system co-optimization framework to enhance KV retrieval efficiency while preserving accuracy. On the algorithm side, FreeKV introduces speculative retrieval to shift the KV selection and recall processes out of the critical path, combined with fine-grained correction to ensure accuracy. On the system side, FreeKV employs hybrid KV layouts across CPU and GPU memory to eliminate fragmented data transfers, and leverages double-buffered streamed recall to further improve efficiency, enabli...