[2602.12675] SLA2: Sparse-Linear Attention with Learnable Routing and QAT
Summary
The paper presents SLA2, an advanced Sparse-Linear Attention model that enhances video generation efficiency by introducing a learnable routing mechanism and quantization-aware fine-tuning.
Why It Matters
SLA2 addresses limitations in previous sparse-linear attention models, improving computational efficiency and maintaining high-quality outputs in video diffusion tasks. This advancement is significant for researchers and practitioners in machine learning and AI, particularly in optimizing model performance while reducing resource consumption.
Key Takeaways
- SLA2 introduces a learnable router for dynamic attention computation selection.
- The model achieves 97% attention sparsity with an 18.6x speedup in processing.
- Quantization-aware fine-tuning is employed to minimize quantization errors.
- SLA2 enhances performance in video generation tasks compared to previous models.
- The proposed method offers a more direct formulation for combining sparse and linear attention.
Computer Science > Machine Learning arXiv:2602.12675 (cs) [Submitted on 13 Feb 2026] Title:SLA2: Sparse-Linear Attention with Learnable Routing and QAT Authors:Jintao Zhang, Haoxu Wang, Kai Jiang, Kaiwen Zheng, Youhe Jiang, Ion Stoica, Jianfei Chen, Jun Zhu, Joseph E. Gonzalez View a PDF of the paper titled SLA2: Sparse-Linear Attention with Learnable Routing and QAT, by Jintao Zhang and 8 other authors View PDF HTML (experimental) Abstract:Sparse-Linear Attention (SLA) combines sparse and linear attention to accelerate diffusion models and has shown strong performance in video generation. However, (i) SLA relies on a heuristic split that assigns computations to the sparse or linear branch based on attention-weight magnitude, which can be suboptimal. Additionally, (ii) after formally analyzing the attention error in SLA, we identify a mismatch between SLA and a direct decomposition into sparse and linear attention. We propose SLA2, which introduces (I) a learnable router that dynamically selects whether each attention computation should use sparse or linear attention, (II) a more faithful and direct sparse-linear attention formulation that uses a learnable ratio to combine the sparse and linear attention branches, and (III) a sparse + low-bit attention design, where low-bit attention is introduced via quantization-aware fine-tuning to reduce quantization error. Experiments show that on video diffusion models, SLA2 can achieve 97% attention sparsity and deliver an 18.6x att...