[2506.18656] On the Interpolation Error of Nonlinear Attention versus Linear Regression
Summary
This paper analyzes the interpolation error of nonlinear attention mechanisms compared to linear regression, revealing insights into their performance under various data conditions.
Why It Matters
Understanding the interpolation error in nonlinear attention is crucial for improving machine learning models, particularly in high-dimensional settings. This research provides theoretical insights that can guide the development of more efficient algorithms, especially as data complexity increases.
Key Takeaways
- Nonlinear attention generally incurs a higher interpolation error than linear regression on random inputs.
- The interpolation error gap can disappear or reverse when input data contains structured signals.
- Theoretical insights are supported by numerical experiments, enhancing the understanding of attention mechanisms.
Statistics > Machine Learning arXiv:2506.18656 (stat) [Submitted on 23 Jun 2025 (v1), last revised 26 Feb 2026 (this version, v2)] Title:On the Interpolation Error of Nonlinear Attention versus Linear Regression Authors:Zhenyu Liao, Jiaqing Liu, TianQi Hou, Difan Zou, Zenan Ling View a PDF of the paper titled On the Interpolation Error of Nonlinear Attention versus Linear Regression, by Zhenyu Liao and 4 other authors View PDF Abstract:Attention has become the core building block of modern machine learning (ML) by efficiently capturing the long-range dependencies among input tokens. Its inherently parallelizable structure allows for efficient performance scaling with the rapidly increasing size of both data and model parameters. Despite its central role, the theoretical understanding of Attention, especially in the nonlinear setting, is progressing at a more modest pace. This paper provides a precise characterization of the interpolation error for a nonlinear Attention, in the high-dimensional regime where the number of input tokens $n$ and the embedding dimension $p$ are both large and comparable. Under a signal-plus-noise data model and for fixed Attention weights, we derive explicit (limiting) expressions for the mean-squared interpolation error. Leveraging recent advances in random matrix theory, we show that nonlinear Attention generally incurs a larger interpolation error than linear regression on random inputs. However, this gap vanishes, and can even be reversed, w...