[2602.18849] Exact Attention Sensitivity and the Geometry of Transformer Stability
Summary
This article presents a stability theory for transformers, explaining key training dynamics and architectural considerations that affect their performance and sensitivity.
Why It Matters
Understanding transformer stability is crucial for improving AI models' training efficiency and robustness. This research provides foundational insights that could lead to better design choices in machine learning architectures, impacting various applications in AI.
Key Takeaways
- Introduces a stability theory explaining transformer training dynamics.
- Demonstrates that architectural gradient flow is key to stability, not just attention dynamics.
- Validates findings on large-scale models, enhancing understanding of LayerNorm effects.
Computer Science > Machine Learning arXiv:2602.18849 (cs) [Submitted on 21 Feb 2026] Title:Exact Attention Sensitivity and the Geometry of Transformer Stability Authors:Seyed Morteza Emadi View a PDF of the paper titled Exact Attention Sensitivity and the Geometry of Transformer Stability, by Seyed Morteza Emadi View PDF HTML (experimental) Abstract:Despite powering modern AI, transformers remain mysteriously brittle to train. We develop a stability theory that explains why pre-LayerNorm works, why DeepNorm uses $N^{-1/4}$ scaling, and why warmup is necessary, all from first principles. Our framework has two pillars: (1) We derive the \emph{exact} operator norm of the softmax Jacobian, $\|J_{softmax}(u/\tau)\|_{\infty\to 1} = \theta(p)/\tau$, where the balanced-mass factor $\theta(p)\in[0,1]$ quantifies attention sensitivity. (2) We introduce a block-$\infty$/RMS geometry aligned with tokenwise computation, yielding Lipschitz bounds independent of sequence length. Using this framework, we prove that pre-LN preserves identity gradient paths while post-LN compounds LayerNorm Jacobians exponentially with depth, and we show that DeepNorm's $N^{-1/4}$ emerges from the quartic structure of attention's four projection matrices. We validate our theory on 774M-parameter models and find that, contrary to the intuition that attention sharpens during training to reduce sensitivity, $\theta(p) \approx 1$ persists throughout. Transformer stability arises entirely from architectural grad...