[2512.03324] Cache What Lasts: Token Retention for Memory-Bounded KV Cache in LLMs
About this article
Abstract page for arXiv paper 2512.03324: Cache What Lasts: Token Retention for Memory-Bounded KV Cache in LLMs
Computer Science > Machine Learning arXiv:2512.03324 (cs) [Submitted on 3 Dec 2025 (v1), last revised 1 Mar 2026 (this version, v2)] Title:Cache What Lasts: Token Retention for Memory-Bounded KV Cache in LLMs Authors:Ngoc Bui, Shubham Sharma, Simran Lamba, Saumitra Mishra, Rex Ying View a PDF of the paper titled Cache What Lasts: Token Retention for Memory-Bounded KV Cache in LLMs, by Ngoc Bui and 4 other authors View PDF HTML (experimental) Abstract:Memory and computation remain core bottlenecks in long-horizon LLM inference due to the quadratic cost of self-attention and the ever-growing key-value (KV) cache. Existing strategies for memory-bounded inference, such as quantization, offloading, or heuristic KV eviction, either incur high orchestration costs or rely on unreliable attention-based proxies of importance. We propose TRIM-KV, a novel approach that learns each token's intrinsic importance at creation time via a lightweight retention gate. Each gate predicts a scalar retention score that decays over time, reflecting the long-term utility of the token for a specific layer and head. Tokens with low scores are evicted when the memory budget is exceeded, ensuring that the cache always contains the most critical tokens. TRIM-KV is trained efficiently through distillation from a frozen LLM combined with a capacity loss, requiring only gate fine-tuning and adding negligible inference overhead. Across mathematical reasoning (GSM8K, MATH-500, AIME24), procedural generation ...