[2603.02170] SageBwd: A Trainable Low-bit Attention

arXiv - Machine Learning March 03, 2026 3 min read

About this article

Abstract page for arXiv paper 2603.02170: SageBwd: A Trainable Low-bit Attention

Computer Science > Machine Learning arXiv:2603.02170 (cs) [Submitted on 2 Mar 2026] Title:SageBwd: A Trainable Low-bit Attention Authors:Jintao Zhang, Marco Chen, Haoxu Wang, Kai Jiang, Ion Stoica, Joseph E. Gonzalez, Jianfei Chen, Jun Zhu View a PDF of the paper titled SageBwd: A Trainable Low-bit Attention, by Jintao Zhang and 7 other authors View PDF HTML (experimental) Abstract:Low-bit attention, such as SageAttention, has emerged as an effective approach for accelerating model inference, but its applicability to training remains poorly understood. In prior work, we introduced SageBwd, a trainable INT8 attention that quantizes six of seven attention matrix multiplications while preserving fine-tuning performance. However, SageBwd exhibited a persistent performance gap to full-precision attention (FPA) during pre-training. In this work, we investigate why this gap occurs and demonstrate that SageBwd matches full-precision attention during pretraining. Through experiments and theoretical analysis, we reach a few important insights and conclusions: (i) QK-norm is necessary for stable training at large tokens per step, (ii) quantization errors primarily arise from the backward-pass score gradient dS, (iii) reducing tokens per step enables SageBwd to match FPA performance in pre-training, and (iv) K-smoothing remains essential for training stability, while Q-smoothing provides limited benefit during pre-training. Subjects: Machine Learning (cs.LG); Artificial Intelligence (...

Originally published on March 03, 2026. Curated by AI News.

Machine Learning

[D] When to transition from simple heuristics to ML models (e.g., DensityFunction)?

Two questions: What are the recommendations around when to transition from a simple heuristic baseline to machine learning ML models for ...

Reddit - Machine Learning · 1 min · less than a minute ago

Machine Learning

[D] ICML 2026 Average Score

Hi all, I’m curious about the current review dynamics for ICML 2026, especially after the rebuttal phase. For those who are reviewers (or...

Reddit - Machine Learning · 1 min · about 3 hours ago

Machine Learning

[R] VOID: Video Object and Interaction Deletion (physically-consistent video inpainting)

We present VOID, a model for video object removal that aims to handle *physical interactions*, not just appearance. Most existing video i...

Reddit - Machine Learning · 1 min · about 8 hours ago

Machine Learning

FLUX 2 Pro (2026) Sketch to Image

I sketched a cow and tested how different models interpret it into a realistic image for downstream 3D generation, turns out some models ...

Reddit - Artificial Intelligence · 1 min · about 9 hours ago

[2603.02170] SageBwd: A Trainable Low-bit Attention

About this article

Related Articles

[D] When to transition from simple heuristics to ML models (e.g., DensityFunction)?

[D] ICML 2026 Average Score

[R] VOID: Video Object and Interaction Deletion (physically-consistent video inpainting)

FLUX 2 Pro (2026) Sketch to Image

No comments

Stay updated with AI News