[2602.20732] CHESS: Context-aware Hierarchical Efficient Semantic Selection for Long-Context LLM Inference
Summary
The paper presents CHESS, a novel KV-cache management system designed for long-context LLM inference, enhancing efficiency and throughput while maintaining quality.
Why It Matters
As long-context LLMs become increasingly prevalent, optimizing their inference processes is crucial for practical applications. CHESS addresses key limitations of existing methods, offering a significant improvement in performance and efficiency, which is vital for developers and researchers in AI and machine learning.
Key Takeaways
- CHESS introduces a context-aware, hierarchical selection policy for KV-cache management.
- It achieves low-latency inference with up to 4.56x higher throughput than traditional methods.
- The system utilizes only 1% of the KV cache while surpassing Full-KV quality.
- Coarse granularity selection reduces data movement, enhancing practical acceleration.
- Extensive evaluations demonstrate CHESS's superiority over existing baselines.
Computer Science > Artificial Intelligence arXiv:2602.20732 (cs) [Submitted on 24 Feb 2026] Title:CHESS: Context-aware Hierarchical Efficient Semantic Selection for Long-Context LLM Inference Authors:Chao Fei, Guozhong Li, Chenxi Liu, Panos Kalnis View a PDF of the paper titled CHESS: Context-aware Hierarchical Efficient Semantic Selection for Long-Context LLM Inference, by Chao Fei and Guozhong Li and Chenxi Liu and Panos Kalnis View PDF HTML (experimental) Abstract:Long-context LLMs demand accurate inference at low latency, yet decoding becomes primarily constrained by KV cache as context grows. Prior pruning methods are largely context-agnostic: their token selection ignores step-wise relevance and local semantics, which undermines quality. Moreover, their irregular accesses and selection overheads yield only limited wall-clock speedups. To address this, we propose \textbf{CHESS}, an \textit{algorithm-system co-design} KV-cache management system. Algorithmically, CHESS introduces a context-aware, hierarchical selection policy that dynamically reconstructs a coherent context for the current decoding. System-wise, coarse granularity selection eliminates expensive data movement, fully realizing practical acceleration from theoretical sparsity. Extensive evaluations demonstrate that CHESS surpasses Full-KV quality using only \textbf{1\%} of the KV cache, delivers low-latency stable inference with up to \textbf{4.56$\times$} higher throughput, and consistently outperforms ot...