[2601.16514] Finite-Time Analysis of Gradient Descent for Shallow

[2601.16514] Finite-Time Analysis of Gradient Descent for Shallow Transformers

arXiv - Machine Learning April 03, 2026 3 min read

About this article

Abstract page for arXiv paper 2601.16514: Finite-Time Analysis of Gradient Descent for Shallow Transformers

Computer Science > Machine Learning arXiv:2601.16514 (cs) [Submitted on 23 Jan 2026 (v1), last revised 2 Apr 2026 (this version, v2)] Title:Finite-Time Analysis of Gradient Descent for Shallow Transformers Authors:Enes Arda, Semih Cayci, Atilla Eryilmaz View a PDF of the paper titled Finite-Time Analysis of Gradient Descent for Shallow Transformers, by Enes Arda and 2 other authors View PDF HTML (experimental) Abstract:Understanding why Transformers perform so well remains challenging due to their non-convex optimization landscape. In this work, we analyze a shallow Transformer with $m$ independent heads trained by projected gradient descent in the kernel regime. Our analysis reveals two main findings: (i) the width required for nonasymptotic guarantees scales only logarithmically with the sample size $n$, and (ii) the optimization error is independent of the sequence length $T$. This contrasts sharply with recurrent architectures, where the optimization error can grow exponentially with $T$. The trade-off is memory: to keep the full context, the Transformer's memory requirement grows with the sequence length. We validate our theoretical results numerically in a teacher-student setting and compare Transformers with recurrent architectures on an autoregressive task. Comments: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC) Cite as: arXiv:2601.16514 [cs.LG] (or arXiv:2601.16514v2 [cs.LG] for this version) https://do...

Originally published on April 03, 2026. Curated by AI News.

Machine Learning

How do you anonymize code for a conference submission? [D]

Hi everyone, I have a question about anonymizing code for conference submissions. I’m submitting an AI/ML paper to a conference and would...

Reddit - Machine Learning · 1 min · 30 minutes ago

Machine Learning

Now Meta will track what employees do on their computers to train its AI agents | The Verge

Meta is reportedly using tracking software to record its employees’ mouse and keyboard activity for training data for its AI agents.

The Verge - AI · 4 min · about 2 hours ago

Llms

Training-time intervention yields 63.4% blind-pair human preference at matched val-loss (1.2B params, 320 judgments, p = 1.98 × 10⁻⁵) [R]

TL;DR. I ran a blind A/B preference evaluation between two 1.2B-parameter LMs trained on identical data (same order, same seed, 30K steps...

Reddit - Machine Learning · 1 min · about 3 hours ago

Machine Learning

I can't believe text normalization is so underdiscussed in streaming text-to-speech [D]

Kinda suprises me how little discussion there is around about mistakes in streaming TTS models People look for natural readers, high voic...

Reddit - Machine Learning · 1 min · about 4 hours ago

[2601.16514] Finite-Time Analysis of Gradient Descent for Shallow Transformers

About this article

Related Articles

How do you anonymize code for a conference submission? [D]

Now Meta will track what employees do on their computers to train its AI agents | The Verge

Training-time intervention yields 63.4% blind-pair human preference at matched val-loss (1.2B params, 320 judgments, p = 1.98 × 10⁻⁵) [R]

I can't believe text normalization is so underdiscussed in streaming text-to-speech [D]

No comments

Stay updated with AI News