[2605.07234] Reformulating KV Cache Eviction Problem for Long-Context LLM Inference
About this article
Abstract page for arXiv paper 2605.07234: Reformulating KV Cache Eviction Problem for Long-Context LLM Inference
Computer Science > Computation and Language arXiv:2605.07234 (cs) [Submitted on 8 May 2026] Title:Reformulating KV Cache Eviction Problem for Long-Context LLM Inference Authors:Tho Mai, Joo-Young Kim View a PDF of the paper titled Reformulating KV Cache Eviction Problem for Long-Context LLM Inference, by Tho Mai and 1 other authors View PDF HTML (experimental) Abstract:Large language models (LLMs) support long-context inference but suffer from substantial memory and runtime overhead due to Key-Value (KV) Cache growth. Existing KV Cache eviction methods primarily rely on local attention weights, neglecting the influence of value representations, output projection, and inter-head interactions. In this work, we reformulate KV Cache eviction from a conventional head-wise, weight-averaging approach into an output-aware, layer-wise matrix multiplication approximation problem. We introduce LaProx, a novel eviction strategy that explicitly models the multiplicative interaction between attention maps and projected value states to accurately quantify token contributions while accounting for inter-head dependencies. Building on this metric, we propose the first unified eviction strategy that assigns globally comparable importance scores to tokens, enabling model-wide selection instead of local, head-wise decisions. Experimental results across 19 datasets on long-context benchmarks LongBench and Needle-In-A-Haystack demonstrate that our approach maintains model performance with only 5...