[2603.27819] KVSculpt: KV Cache Compression as Distillation
About this article
Abstract page for arXiv paper 2603.27819: KVSculpt: KV Cache Compression as Distillation
Computer Science > Machine Learning arXiv:2603.27819 (cs) [Submitted on 29 Mar 2026] Title:KVSculpt: KV Cache Compression as Distillation Authors:Bo Jiang, Sian Jin View a PDF of the paper titled KVSculpt: KV Cache Compression as Distillation, by Bo Jiang and 1 other authors View PDF HTML (experimental) Abstract:KV cache compression is critical for efficient long-context LLM inference. Approaches that reduce the per-pair footprint -- quantization and low-rank decomposition -- are orthogonal to those that reduce the sequence length of the cache. Along the sequence-length dimension, existing methods range from pure eviction -- selecting which KV pairs to keep -- to merging, which combines similar pairs into fewer ones. Both remain anchored to the original cache entries. We propose KVSculpt, which moves to the other end of this spectrum: instead of selecting or combining original pairs, we optimize a smaller set of unconstrained KV pairs in continuous embedding space to preserve each layer's attention behavior. Keys are optimized via L-BFGS and values are solved in closed form via least squares, alternating every few steps. On top of this, we introduce adaptive budget allocation, which uses a cheap pilot compression run to redistribute the compression budget across layers and KV heads based on per-component difficulty. On Qwen2.5-1.5B-Instruct with 2048-token contexts, KVSculpt reduces KL divergence by 3.5-4.1x compared to Select+Fit -- attention-score eviction with least-squ...