[2604.03190] Gradient Boosting within a Single Attention Layer
About this article
Abstract page for arXiv paper 2604.03190: Gradient Boosting within a Single Attention Layer
Computer Science > Machine Learning arXiv:2604.03190 (cs) [Submitted on 3 Apr 2026] Title:Gradient Boosting within a Single Attention Layer Authors:Saleh Sargolzaei View a PDF of the paper titled Gradient Boosting within a Single Attention Layer, by Saleh Sargolzaei View PDF HTML (experimental) Abstract:Transformer attention computes a single softmax-weighted average over values -- a one-pass estimate that cannot correct its own errors. We introduce \emph{gradient-boosted attention}, which applies the principle of gradient boosting \emph{within} a single attention layer: a second attention pass, with its own learned projections, attends to the prediction error of the first and applies a gated correction. Under a squared reconstruction objective, the construction maps onto Friedman's gradient boosting machine, with each attention pass as a base learner and the per-dimension gate as the shrinkage parameter. We show that a single Hopfield-style update erases all query information orthogonal to the stored-pattern subspace, and that further iteration under local contraction can collapse distinct queries in the same region to the same fixed point. We also show that separate projections for the correction pass can recover residual information inaccessible to the shared-projection approach of Tukey's twicing. On a 10M-token subset of WikiText-103, gradient-boosted attention achieves a test perplexity of $67.9$ compared to $72.2$ for standard attention, $69.6$ for Twicing Attention...