[2602.13804] Attention in Constant Time: Vashista Sparse Attention for Long-Context Decoding with Exponential Guarantees
Summary
The paper presents Vashista Sparse Attention, a novel mechanism for efficient long-context decoding in large language models, ensuring constant time attention with minimal quality loss.
Why It Matters
This research addresses the computational inefficiencies associated with attention mechanisms in large language models, particularly for long contexts. By offering a theoretical framework and practical implementation, it provides a pathway to enhance performance in resource-constrained environments, making it relevant for developers and researchers in AI and machine learning.
Key Takeaways
- Introduces Vashista Sparse Attention for efficient long-context processing.
- Demonstrates exponential guarantees for attention mechanisms.
- Provides a practical criterion for balancing accuracy and computational cost.
- Offers insights into deployment in privacy-sensitive environments.
- Shows minimal quality degradation with significant speed improvements.
Computer Science > Artificial Intelligence arXiv:2602.13804 (cs) [Submitted on 14 Feb 2026] Title:Attention in Constant Time: Vashista Sparse Attention for Long-Context Decoding with Exponential Guarantees Authors:Vashista Nobaub View a PDF of the paper titled Attention in Constant Time: Vashista Sparse Attention for Long-Context Decoding with Exponential Guarantees, by Vashista Nobaub View PDF HTML (experimental) Abstract:Large language models spend most of their inference cost on attention over long contexts, yet empirical behavior suggests that only a small subset of tokens meaningfully contributes to each query. We formalize this phenomenon by modeling attention as a projection onto the convex hull of key vectors and analyzing its entropic (softmax-like) relaxation. Our main theoretical contribution is a face-stability theorem showing that, under a strict complementarity margin (a support gap (\Delta) certified by KKT multipliers), entropic attention concentrates on a constant-size active face: the total mass assigned to inactive tokens decays exponentially as (\exp(-\Omega(\Delta/\varepsilon))), while the error on the active face scales linearly in the temperature/regularization parameter (\varepsilon). This yields a practical criterion for when sparse long-context decoding is safe and provides a principled knob to trade accuracy for compute. Building on these guarantees, we introduce Vashista Sparse Attention, a drop-in mechanism that maintains a small candidate set ...