[2603.02170] SageBwd: A Trainable Low-bit Attention

[2603.02170] SageBwd: A Trainable Low-bit Attention

arXiv - Machine Learning 3 min read

About this article

Abstract page for arXiv paper 2603.02170: SageBwd: A Trainable Low-bit Attention

Computer Science > Machine Learning arXiv:2603.02170 (cs) [Submitted on 2 Mar 2026] Title:SageBwd: A Trainable Low-bit Attention Authors:Jintao Zhang, Marco Chen, Haoxu Wang, Kai Jiang, Ion Stoica, Joseph E. Gonzalez, Jianfei Chen, Jun Zhu View a PDF of the paper titled SageBwd: A Trainable Low-bit Attention, by Jintao Zhang and 7 other authors View PDF HTML (experimental) Abstract:Low-bit attention, such as SageAttention, has emerged as an effective approach for accelerating model inference, but its applicability to training remains poorly understood. In prior work, we introduced SageBwd, a trainable INT8 attention that quantizes six of seven attention matrix multiplications while preserving fine-tuning performance. However, SageBwd exhibited a persistent performance gap to full-precision attention (FPA) during pre-training. In this work, we investigate why this gap occurs and demonstrate that SageBwd matches full-precision attention during pretraining. Through experiments and theoretical analysis, we reach a few important insights and conclusions: (i) QK-norm is necessary for stable training at large tokens per step, (ii) quantization errors primarily arise from the backward-pass score gradient dS, (iii) reducing tokens per step enables SageBwd to match FPA performance in pre-training, and (iv) K-smoothing remains essential for training stability, while Q-smoothing provides limited benefit during pre-training. Subjects: Machine Learning (cs.LG); Artificial Intelligence (...

Originally published on March 03, 2026. Curated by AI News.

Related Articles

Machine Learning

[D] When to transition from simple heuristics to ML models (e.g., DensityFunction)?

Two questions: What are the recommendations around when to transition from a simple heuristic baseline to machine learning ML models for ...

Reddit - Machine Learning · 1 min ·
Machine Learning

[D] ICML 2026 Average Score

Hi all, I’m curious about the current review dynamics for ICML 2026, especially after the rebuttal phase. For those who are reviewers (or...

Reddit - Machine Learning · 1 min ·
Machine Learning

[R] VOID: Video Object and Interaction Deletion (physically-consistent video inpainting)

We present VOID, a model for video object removal that aims to handle *physical interactions*, not just appearance. Most existing video i...

Reddit - Machine Learning · 1 min ·
Machine Learning

FLUX 2 Pro (2026) Sketch to Image

I sketched a cow and tested how different models interpret it into a realistic image for downstream 3D generation, turns out some models ...

Reddit - Artificial Intelligence · 1 min ·
More in Machine Learning: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime