[2601.23236] YuriiFormer: A Suite of Nesterov-Accelerated Transformers
About this article
Abstract page for arXiv paper 2601.23236: YuriiFormer: A Suite of Nesterov-Accelerated Transformers
Computer Science > Machine Learning arXiv:2601.23236 (cs) [Submitted on 30 Jan 2026 (v1), last revised 4 Mar 2026 (this version, v2)] Title:YuriiFormer: A Suite of Nesterov-Accelerated Transformers Authors:Aleksandr Zimin, Yury Polyanskiy, Philippe Rigollet View a PDF of the paper titled YuriiFormer: A Suite of Nesterov-Accelerated Transformers, by Aleksandr Zimin and 2 other authors View PDF HTML (experimental) Abstract:We propose a variational framework that interprets transformer layers as iterations of an optimization algorithm acting on token embeddings. In this view, self-attention implements a gradient step of an interaction energy, while MLP layers correspond to gradient updates of a potential energy. Standard GPT-style transformers emerge as vanilla gradient descent on the resulting composite objective, implemented via Lie--Trotter splitting between these two energy functionals. This perspective enables principled architectural design using classical optimization ideas. As a proof of concept, we introduce a Nesterov-style accelerated transformer that preserves the same attention and MLP oracles. The resulting architecture consistently outperforms a nanoGPT baseline on TinyStories and OpenWebText, demonstrating that optimization-theoretic insights can translate into practical gains. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Machine Learning (stat.ML) Cite as: arXiv:2601.23236 [cs.LG] (or arXiv:2601.23...