[2602.22575] S2O: Early Stopping for Sparse Attention via Online Permutation
Summary
The paper presents S2O, a novel approach for early stopping in sparse attention mechanisms, enhancing efficiency in long-context inference by optimizing token loading and computation.
Why It Matters
As machine learning models increasingly require handling long sequences, optimizing attention mechanisms is crucial for improving performance and reducing computational costs. S2O addresses these challenges by introducing a method that allows for more effective sparsity, which can significantly enhance the efficiency of large language models.
Key Takeaways
- S2O improves attention mechanism efficiency by using online permutation for token loading.
- The method allows for early stopping in computations, focusing on high-priority blocks.
- S2O achieves significant reductions in mean squared error and compute density while maintaining accuracy.
Computer Science > Machine Learning arXiv:2602.22575 (cs) [Submitted on 26 Feb 2026] Title:S2O: Early Stopping for Sparse Attention via Online Permutation Authors:Yu Zhang, Songwei Liu, Chenqian Yan, Sheng Lin, Beichen Ning, Fangmin Chen, Xing Wang View a PDF of the paper titled S2O: Early Stopping for Sparse Attention via Online Permutation, by Yu Zhang and 6 other authors View PDF HTML (experimental) Abstract:Attention scales quadratically with sequence length, fundamentally limiting long-context inference. Existing block-granularity sparsification can reduce latency, but coarse blocks impose an intrinsic sparsity ceiling, making further improvements difficult even with carefully engineered designs. We present S2O, which performs early stopping for sparse attention via online permutation. Inspired by virtual-to-physical address mapping in memory systems, S2O revisits and factorizes FlashAttention execution, enabling inference to load non-contiguous tokens rather than a contiguous span in the original order. Motivated by fine-grained structures in attention heatmaps, we transform explicit permutation into an online, index-guided, discrete loading policy; with extremely lightweight preprocessing and index-remapping overhead, it concentrates importance on a small set of high-priority blocks. Building on this importance-guided online permutation for loading, S2O further introduces an early-stopping rule: computation proceeds from high to low importance; once the current bloc...