[2602.23057] Affine-Scaled Attention: Towards Flexible and Stable Transformer Attention
Summary
The paper introduces Affine-Scaled Attention, a novel approach to Transformer attention that enhances flexibility and stability by modifying the normalization process, leading to improved training outcomes and performance in large-scale language models.
Why It Matters
This research addresses limitations in traditional Transformer attention mechanisms, which can hinder model performance. By proposing a method that allows for more controlled attention scaling, it opens pathways for more robust AI models, particularly in natural language processing tasks.
Key Takeaways
- Affine-Scaled Attention introduces input-dependent scaling to Transformer models.
- This method relaxes strict normalization constraints, enhancing attention control.
- Empirical evaluations show improvements in training stability and task performance.
- The approach offers a practical solution for optimizing attention behavior in AI models.
- Modest reweighting of attention outputs can significantly impact model efficiency.
Computer Science > Computation and Language arXiv:2602.23057 (cs) [Submitted on 26 Feb 2026] Title:Affine-Scaled Attention: Towards Flexible and Stable Transformer Attention Authors:Jeongin Bae, Baeseong Park, Gunho Park, Minsub Kim, Joonhyung Lee, Junhee Yoo, Sunghyeon Woo, Jiwon Ryu, Se Jung Kwon, Dongsoo Lee View a PDF of the paper titled Affine-Scaled Attention: Towards Flexible and Stable Transformer Attention, by Jeongin Bae and 9 other authors View PDF HTML (experimental) Abstract:Transformer attention is typically implemented using softmax normalization, which enforces attention weights with unit sum normalization. While effective in many settings, this constraint can limit flexibility in controlling attention magnitudes and may contribute to overly concentrated or unstable attention patterns during training. Prior work has explored modifications such as attention sinks or gating mechanisms, but these approaches provide only limited or indirect control over attention reweighting. We propose Affine-Scaled Attention, a simple extension to standard attention that introduces input-dependent scaling and a corresponding bias term applied to softmax-normalized attention weights. This design relaxes the strict normalization constraint while maintaining aggregation of value representations, allowing the model to adjust both the relative distribution and the scale of attention in a controlled manner. We empirically evaluate Affine-Scaled Attention in large-scale language mod...