[2604.08584] CSAttention: Centroid-Scoring Attention for Accelerating LLM Inference
About this article
Abstract page for arXiv paper 2604.08584: CSAttention: Centroid-Scoring Attention for Accelerating LLM Inference
Computer Science > Machine Learning arXiv:2604.08584 (cs) [Submitted on 30 Mar 2026] Title:CSAttention: Centroid-Scoring Attention for Accelerating LLM Inference Authors:Chuxu Song, Zhencan Peng, Jiuqi Wei, Chuanhui Yang View a PDF of the paper titled CSAttention: Centroid-Scoring Attention for Accelerating LLM Inference, by Chuxu Song and 3 other authors View PDF HTML (experimental) Abstract:Long-context LLMs increasingly rely on extended, reusable prefill prompts for agents and domain Q&A, pushing attention and KV-cache to become the dominant decode-time bottlenecks. While sparse attention reduces computation and transfer costs, it often struggles to maintain accuracy at high sparsity levels due to the inherent distribution shift between Queries and Keys. We propose Centroid-Scoring Attention (CSAttention), a training-free sparse attention method optimized for high-throughput serving of reusable contexts. CSAttention adopts a storage-for-computation strategy tailored to the offline-prefill/online-decode setting: it front-loads computation into a one-time offline prefill phase that can be amortized across multiple queries, while aggressively optimizing per-step decoding latency. Specifically, CSAttention constructs query-centric lookup tables during offline prefill, whose size remains fixed during decoding, and enables online decoding to replace full-context scans with efficient table lookups and GPU-friendly score accumulation. Extensive experiments demonstrate that CSAt...