[2512.10656] Token Sample Complexity of Attention
About this article
Abstract page for arXiv paper 2512.10656: Token Sample Complexity of Attention
Computer Science > Machine Learning arXiv:2512.10656 (cs) [Submitted on 11 Dec 2025 (v1), last revised 23 Mar 2026 (this version, v2)] Title:Token Sample Complexity of Attention Authors:Léa Bohbot, Cyril Letrouit, Gabriel Peyré, François-Xavier Vialard View a PDF of the paper titled Token Sample Complexity of Attention, by L\'ea Bohbot and 3 other authors View PDF HTML (experimental) Abstract:As context windows in large language models continue to expand, it is essential to characterize how attention behaves at extreme sequence lengths. We introduce token-sample complexity: the rate at which attention computed on $n$ tokens converges to its infinite-token limit. We estimate finite-$n$ convergence bounds at two levels: pointwise uniform convergence of the attention map, and convergence of moments for the transformed token distribution. For compactly supported (and more generally sub-Gaussian) distributions, our first result shows that the attention map converges uniformly on a ball of radius $R$ at rate $C(R)/\sqrt{n}$, where $C(R)$ grows exponentially with $R$. For large $R$, this estimate loses practical value, and our second result addresses this issue by establishing convergence rates for the moments of the transformed distribution (the token output of the attention layer). In this case, the rate is $C'(R)/n^{\beta}$ with $\beta<\tfrac{1}{2}$, and $C'(R)$ depends polynomially on the size of the support of the distribution. The exponent $\beta$ depends on the attention g...