[2512.03324] Cache What Lasts: Token Retention for Memory-Bounded KV

[2512.03324] Cache What Lasts: Token Retention for Memory-Bounded KV Cache in LLMs

arXiv - Machine Learning March 03, 2026 4 min read

About this article

Abstract page for arXiv paper 2512.03324: Cache What Lasts: Token Retention for Memory-Bounded KV Cache in LLMs

Computer Science > Machine Learning arXiv:2512.03324 (cs) [Submitted on 3 Dec 2025 (v1), last revised 1 Mar 2026 (this version, v2)] Title:Cache What Lasts: Token Retention for Memory-Bounded KV Cache in LLMs Authors:Ngoc Bui, Shubham Sharma, Simran Lamba, Saumitra Mishra, Rex Ying View a PDF of the paper titled Cache What Lasts: Token Retention for Memory-Bounded KV Cache in LLMs, by Ngoc Bui and 4 other authors View PDF HTML (experimental) Abstract:Memory and computation remain core bottlenecks in long-horizon LLM inference due to the quadratic cost of self-attention and the ever-growing key-value (KV) cache. Existing strategies for memory-bounded inference, such as quantization, offloading, or heuristic KV eviction, either incur high orchestration costs or rely on unreliable attention-based proxies of importance. We propose TRIM-KV, a novel approach that learns each token's intrinsic importance at creation time via a lightweight retention gate. Each gate predicts a scalar retention score that decays over time, reflecting the long-term utility of the token for a specific layer and head. Tokens with low scores are evicted when the memory budget is exceeded, ensuring that the cache always contains the most critical tokens. TRIM-KV is trained efficiently through distillation from a frozen LLM combined with a capacity loss, requiring only gate fine-tuning and adding negligible inference overhead. Across mathematical reasoning (GSM8K, MATH-500, AIME24), procedural generation ...

Originally published on March 03, 2026. Curated by AI News.

Llms

CLI for Google AI Search (gai.google) — run AI-powered code/tech searches headlessly from your terminal

Google AI (gai.google) gives Gemini-powered answers for technical queries — think AI-enhanced search with code understanding. I built a C...

Reddit - Artificial Intelligence · 1 min · about 1 hour ago

Llms

Why are we blindly trusting AI companies with our data?

Lately I’ve been seeing a story floating around that really made me pause. Apparently, there were claims that the US government asked Ant...

Reddit - Artificial Intelligence · 1 min · about 3 hours ago

Llms

De-aged casts, ChatGPT-generated programs: How AI is changing Korean TV

Artificial intelligence is transforming every corner of industry, and television is no exception. Major networks in Korea have recently a...

AI Tools & Products · 4 min · about 4 hours ago

Llms

[2603.16629] MLLM-based Textual Explanations for Face Comparison

Abstract page for arXiv paper 2603.16629: MLLM-based Textual Explanations for Face Comparison

arXiv - AI · 4 min · about 5 hours ago

[2512.03324] Cache What Lasts: Token Retention for Memory-Bounded KV Cache in LLMs

About this article

Related Articles

CLI for Google AI Search (gai.google) — run AI-powered code/tech searches headlessly from your terminal

Why are we blindly trusting AI companies with our data?

De-aged casts, ChatGPT-generated programs: How AI is changing Korean TV

[2603.16629] MLLM-based Textual Explanations for Face Comparison

No comments

Stay updated with AI News