[2601.16514] Finite-Time Analysis of Gradient Descent for Shallow Transformers

[2601.16514] Finite-Time Analysis of Gradient Descent for Shallow Transformers

arXiv - Machine Learning 3 min read

About this article

Abstract page for arXiv paper 2601.16514: Finite-Time Analysis of Gradient Descent for Shallow Transformers

Computer Science > Machine Learning arXiv:2601.16514 (cs) [Submitted on 23 Jan 2026 (v1), last revised 2 Apr 2026 (this version, v2)] Title:Finite-Time Analysis of Gradient Descent for Shallow Transformers Authors:Enes Arda, Semih Cayci, Atilla Eryilmaz View a PDF of the paper titled Finite-Time Analysis of Gradient Descent for Shallow Transformers, by Enes Arda and 2 other authors View PDF HTML (experimental) Abstract:Understanding why Transformers perform so well remains challenging due to their non-convex optimization landscape. In this work, we analyze a shallow Transformer with $m$ independent heads trained by projected gradient descent in the kernel regime. Our analysis reveals two main findings: (i) the width required for nonasymptotic guarantees scales only logarithmically with the sample size $n$, and (ii) the optimization error is independent of the sequence length $T$. This contrasts sharply with recurrent architectures, where the optimization error can grow exponentially with $T$. The trade-off is memory: to keep the full context, the Transformer's memory requirement grows with the sequence length. We validate our theoretical results numerically in a teacher-student setting and compare Transformers with recurrent architectures on an autoregressive task. Comments: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC) Cite as: arXiv:2601.16514 [cs.LG]   (or arXiv:2601.16514v2 [cs.LG] for this version)   https://do...

Originally published on April 03, 2026. Curated by AI News.

Related Articles

Machine Learning

How do you anonymize code for a conference submission? [D]

Hi everyone, I have a question about anonymizing code for conference submissions. I’m submitting an AI/ML paper to a conference and would...

Reddit - Machine Learning · 1 min ·
Now Meta will track what employees do on their computers to train its AI agents | The Verge
Machine Learning

Now Meta will track what employees do on their computers to train its AI agents | The Verge

Meta is reportedly using tracking software to record its employees’ mouse and keyboard activity for training data for its AI agents.

The Verge - AI · 4 min ·
Llms

Training-time intervention yields 63.4% blind-pair human preference at matched val-loss (1.2B params, 320 judgments, p = 1.98 × 10⁻⁵) [R]

TL;DR. I ran a blind A/B preference evaluation between two 1.2B-parameter LMs trained on identical data (same order, same seed, 30K steps...

Reddit - Machine Learning · 1 min ·
Machine Learning

I can't believe text normalization is so underdiscussed in streaming text-to-speech [D]

Kinda suprises me how little discussion there is around about mistakes in streaming TTS models People look for natural readers, high voic...

Reddit - Machine Learning · 1 min ·
More in Machine Learning: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime