[2602.20555] Standard Transformers Achieve the Minimax Rate in Nonparametric Regression with $C^{s,λ}$ Targets
Summary
This paper demonstrates that standard Transformers can achieve the minimax optimal rate in nonparametric regression for Hölder functions, providing theoretical insights into their capabilities.
Why It Matters
Understanding the theoretical foundations of Transformer models is crucial as they are widely used in machine learning applications. This research validates their effectiveness in approximating complex functions, which can enhance their application in various fields such as AI and data science.
Key Takeaways
- Standard Transformers can approximate Hölder functions with arbitrary precision.
- They achieve the minimax optimal rate in nonparametric regression.
- The study introduces metrics for characterizing Transformer structures.
- Upper bounds for the Lipschitz constant and memorization capacity of Transformers are derived.
- These findings provide a theoretical basis for the performance of Transformer models.
Statistics > Machine Learning arXiv:2602.20555 (stat) [Submitted on 24 Feb 2026] Title:Standard Transformers Achieve the Minimax Rate in Nonparametric Regression with $C^{s,λ}$ Targets Authors:Yanming Lai, Defeng Sun View a PDF of the paper titled Standard Transformers Achieve the Minimax Rate in Nonparametric Regression with $C^{s,\lambda}$ Targets, by Yanming Lai and Defeng Sun View PDF HTML (experimental) Abstract:The tremendous success of Transformer models in fields such as large language models and computer vision necessitates a rigorous theoretical investigation. To the best of our knowledge, this paper is the first work proving that standard Transformers can approximate Hölder functions $ C^{s,\lambda}\left([0,1]^{d\times n}\right) $$ (s\in\mathbb{N}_{\geq0},0<\lambda\leq1) $ under the $L^t$ distance ($t \in [1, \infty]$) with arbitrary precision. Building upon this approximation result, we demonstrate that standard Transformers achieve the minimax optimal rate in nonparametric regression for Hölder target functions. It is worth mentioning that, by introducing two metrics: the size tuple and the dimension vector, we provide a fine-grained characterization of Transformer structures, which facilitates future research on the generalization and optimization errors of Transformers with different structures. As intermediate results, we also derive the upper bounds for the Lipschitz constant of standard Transformers and their memorization capacity, which may be of indepen...