[2512.17131] Smoothing DiLoCo with Primal Averaging for Faster Training of LLMs
About this article
Abstract page for arXiv paper 2512.17131: Smoothing DiLoCo with Primal Averaging for Faster Training of LLMs
Computer Science > Machine Learning arXiv:2512.17131 (cs) [Submitted on 18 Dec 2025 (v1), last revised 27 Feb 2026 (this version, v3)] Title:Smoothing DiLoCo with Primal Averaging for Faster Training of LLMs Authors:Aaron Defazio, Konstantin Mishchenko, Parameswaran Raman, Hao-Jun Michael Shi, Lin Xiao View a PDF of the paper titled Smoothing DiLoCo with Primal Averaging for Faster Training of LLMs, by Aaron Defazio and 4 other authors View PDF HTML (experimental) Abstract:We propose Generalized Primal Averaging (GPA), an extension of Nesterov's method that unifies and generalizes recent averaging-based optimizers like single-worker DiLoCo and Schedule-Free, within a non-distributed setting. While DiLoCo relies on a memory-intensive two-loop structure to periodically aggregate pseudo-gradients using Nesterov momentum, GPA eliminates this complexity by decoupling Nesterov's interpolation constants to enable smooth iterate averaging at every step. Structurally, GPA resembles Schedule-Free but replaces uniform averaging with exponential moving averaging. Empirically, GPA consistently outperforms single-worker DiLoCo and AdamW with reduced memory overhead. GPA achieves speedups of 8.71%, 10.13%, and 9.58% over the AdamW baseline in terms of steps to reach target validation loss for Llama-160M, 1B, and 8B models, respectively. Similarly, on the ImageNet ViT workload, GPA achieves speedups of 7% and 25.5% in the small and large batch settings respectively. Furthermore, we prove ...