[2509.21936] Statistical Advantage of Softmax Attention: Insights from Single-Location Regression
Summary
This article explores the statistical advantages of softmax attention mechanisms in large language models, particularly in single-location regression tasks, demonstrating its superior performance over linear attention methods.
Why It Matters
Understanding the effectiveness of softmax attention is crucial for advancing machine learning models, particularly in natural language processing. This research provides insights into why softmax is preferred, helping researchers and practitioners optimize model performance and generalization.
Key Takeaways
- Softmax attention achieves Bayes risk in high-dimensional settings, outperforming linear attention.
- The study identifies essential properties of activation functions for optimal performance.
- Softmax remains effective in finite-sample scenarios, consistently surpassing linear alternatives.
Computer Science > Machine Learning arXiv:2509.21936 (cs) [Submitted on 26 Sep 2025 (v1), last revised 26 Feb 2026 (this version, v2)] Title:Statistical Advantage of Softmax Attention: Insights from Single-Location Regression Authors:O. Duranthon, P. Marion, C. Boyer, B. Loureiro, L. Zdeborová View a PDF of the paper titled Statistical Advantage of Softmax Attention: Insights from Single-Location Regression, by O. Duranthon and 4 other authors View PDF HTML (experimental) Abstract:Large language models rely on attention mechanisms with a softmax activation. Yet the dominance of softmax over alternatives (e.g., component-wise or linear) remains poorly understood, and many theoretical works have focused on the easier-to-analyze linearized attention. In this work, we address this gap through a principled study of the single-location regression task, where the output depends on a linear transformation of a single input token at a random location. Building on ideas from statistical physics, we develop an analysis of attention-based predictors in the high-dimensional limit, where generalization performance is captured by a small set of order parameters. At the population level, we show that softmax achieves the Bayes risk, whereas linear attention fundamentally falls short. We then examine other activation functions to identify which properties are necessary for optimal performance. Finally, we analyze the finite-sample regime: we provide an asymptotic characterization of the te...