[2602.18851] Rank-Aware Spectral Bounds on Attention Logits for Stable Low-Precision Training
Summary
This paper presents a novel approach to stabilize low-precision training in transformer models by deriving rank-aware spectral bounds on attention logits, enhancing overflow guarantees during training.
Why It Matters
As low-precision training becomes increasingly important for efficient model deployment, understanding how to mitigate overflow risks is crucial. This research provides a significant advancement in ensuring model stability and performance, particularly for large-scale transformer architectures.
Key Takeaways
- Introduces rank-aware concentration inequalities for attention logits.
- Demonstrates 8-28x tighter bounds compared to traditional methods.
- Presents geometry-aware scale factors for overflow guarantees.
- Compatible with existing transformer architectures and training methods.
- Achieves comparable accuracy on downstream tasks while eliminating overflows.
Computer Science > Machine Learning arXiv:2602.18851 (cs) [Submitted on 21 Feb 2026] Title:Rank-Aware Spectral Bounds on Attention Logits for Stable Low-Precision Training Authors:Seyed Morteza Emadi View a PDF of the paper titled Rank-Aware Spectral Bounds on Attention Logits for Stable Low-Precision Training, by Seyed Morteza Emadi View PDF HTML (experimental) Abstract:Attention scores in transformers are bilinear forms $S_{ij} = x_i^\top M x_j / \sqrt{d_h}$ whose maximum magnitude governs overflow risk in low-precision training. We derive a \emph{rank-aware concentration inequality}: when the interaction matrix $M = W^Q W^{K\top}$ has rank $r \ll d$, tail probabilities for $\max_{i,j}|S_{ij}|$ decay as $\exp(-d^{2}\alpha^{2}/(\gamma r))$ rather than $\exp(-d\alpha^{2})$, where $\gamma > 1$ is a typicality parameter. For transformer attention where $r = d_h$, this yields $8$--$28\times$ tighter concentration than rank-agnostic bounds in modern architectures. We apply this result to FP8 training, deriving \emph{geometry-aware scale factors} that provide principled overflow guarantees without observing activations. The method computes per-layer scales from the spectral norm $\|W^Q W^{K\top}\|_2$ via implicit power iteration, includes a grouped query attention formulation that avoids key expansion, and remains compatible with fused attention kernels. Across GPT-2 XL to Llama-2-70B, geometry-aware scaling eliminates overflows in transient scenarios where delayed scaling fail...