[2602.20732] CHESS: Context-aware Hierarchical Efficient Semantic Selection for Long-Context LLM Inference

[2602.20732] CHESS: Context-aware Hierarchical Efficient Semantic Selection for Long-Context LLM Inference

arXiv - AI 3 min read Article

Summary

The paper presents CHESS, a novel KV-cache management system designed for long-context LLM inference, enhancing efficiency and throughput while maintaining quality.

Why It Matters

As long-context LLMs become increasingly prevalent, optimizing their inference processes is crucial for practical applications. CHESS addresses key limitations of existing methods, offering a significant improvement in performance and efficiency, which is vital for developers and researchers in AI and machine learning.

Key Takeaways

  • CHESS introduces a context-aware, hierarchical selection policy for KV-cache management.
  • It achieves low-latency inference with up to 4.56x higher throughput than traditional methods.
  • The system utilizes only 1% of the KV cache while surpassing Full-KV quality.
  • Coarse granularity selection reduces data movement, enhancing practical acceleration.
  • Extensive evaluations demonstrate CHESS's superiority over existing baselines.

Computer Science > Artificial Intelligence arXiv:2602.20732 (cs) [Submitted on 24 Feb 2026] Title:CHESS: Context-aware Hierarchical Efficient Semantic Selection for Long-Context LLM Inference Authors:Chao Fei, Guozhong Li, Chenxi Liu, Panos Kalnis View a PDF of the paper titled CHESS: Context-aware Hierarchical Efficient Semantic Selection for Long-Context LLM Inference, by Chao Fei and Guozhong Li and Chenxi Liu and Panos Kalnis View PDF HTML (experimental) Abstract:Long-context LLMs demand accurate inference at low latency, yet decoding becomes primarily constrained by KV cache as context grows. Prior pruning methods are largely context-agnostic: their token selection ignores step-wise relevance and local semantics, which undermines quality. Moreover, their irregular accesses and selection overheads yield only limited wall-clock speedups. To address this, we propose \textbf{CHESS}, an \textit{algorithm-system co-design} KV-cache management system. Algorithmically, CHESS introduces a context-aware, hierarchical selection policy that dynamically reconstructs a coherent context for the current decoding. System-wise, coarse granularity selection eliminates expensive data movement, fully realizing practical acceleration from theoretical sparsity. Extensive evaluations demonstrate that CHESS surpasses Full-KV quality using only \textbf{1\%} of the KV cache, delivers low-latency stable inference with up to \textbf{4.56$\times$} higher throughput, and consistently outperforms ot...

Related Articles

Llms

A robot car with a Claude AI brain started a YouTube vlog about its own existence

Not a demo reel. Not a tutorial. A robot narrating its own experience — debugging, falling off shelves, questioning its identity. First-p...

Reddit - Artificial Intelligence · 1 min ·
Llms

Study: LLMs Able to De-Anonymize User Accounts on Reddit, Hacker News & Other "Pseudonymous" Platforms; Report Co-Author Expands, Advises

Advice from the study's co-author: "Be aware that it’s not any single post that identifies you, but the combination of small details acro...

Reddit - Artificial Intelligence · 1 min ·
Llms

do you guys actually trust AI tools with your data?

idk if it’s just me but lately i’ve been thinking about how casually we use stuff like chatgpt and claude for everything like coding, ran...

Reddit - Artificial Intelligence · 1 min ·
Llms

[P] Remote sensing foundation models made easy to use.

This project enables the idea of tasking remote sensing models to acquire embeddings like we task satellites to acquire data! https://git...

Reddit - Machine Learning · 1 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime