[2512.17131] Smoothing DiLoCo with Primal Averaging for Faster

[2512.17131] Smoothing DiLoCo with Primal Averaging for Faster Training of LLMs

arXiv - Machine Learning March 02, 2026 4 min read

About this article

Abstract page for arXiv paper 2512.17131: Smoothing DiLoCo with Primal Averaging for Faster Training of LLMs

Computer Science > Machine Learning arXiv:2512.17131 (cs) [Submitted on 18 Dec 2025 (v1), last revised 27 Feb 2026 (this version, v3)] Title:Smoothing DiLoCo with Primal Averaging for Faster Training of LLMs Authors:Aaron Defazio, Konstantin Mishchenko, Parameswaran Raman, Hao-Jun Michael Shi, Lin Xiao View a PDF of the paper titled Smoothing DiLoCo with Primal Averaging for Faster Training of LLMs, by Aaron Defazio and 4 other authors View PDF HTML (experimental) Abstract:We propose Generalized Primal Averaging (GPA), an extension of Nesterov's method that unifies and generalizes recent averaging-based optimizers like single-worker DiLoCo and Schedule-Free, within a non-distributed setting. While DiLoCo relies on a memory-intensive two-loop structure to periodically aggregate pseudo-gradients using Nesterov momentum, GPA eliminates this complexity by decoupling Nesterov's interpolation constants to enable smooth iterate averaging at every step. Structurally, GPA resembles Schedule-Free but replaces uniform averaging with exponential moving averaging. Empirically, GPA consistently outperforms single-worker DiLoCo and AdamW with reduced memory overhead. GPA achieves speedups of 8.71%, 10.13%, and 9.58% over the AdamW baseline in terms of steps to reach target validation loss for Llama-160M, 1B, and 8B models, respectively. Similarly, on the ImageNet ViT workload, GPA achieves speedups of 7% and 25.5% in the small and large batch settings respectively. Furthermore, we prove ...

Originally published on March 02, 2026. Curated by AI News.

Llms

What is AI, how do apps like ChatGPT work and why are there concerns?

AI is transforming modern life, but some critics worry about its potential misuse and environmental impact.

AI News - General · 7 min · 10 minutes ago

Llms

[2603.29957] Think Anywhere in Code Generation

Abstract page for arXiv paper 2603.29957: Think Anywhere in Code Generation

arXiv - Machine Learning · 3 min · about 3 hours ago

Llms

[2603.16880] NeuroNarrator: A Generalist EEG-to-Text Foundation Model for Clinical Interpretation via Spectro-Spatial Grounding and Temporal State-Space Reasoning

Abstract page for arXiv paper 2603.16880: NeuroNarrator: A Generalist EEG-to-Text Foundation Model for Clinical Interpretation via Spectr...

arXiv - Machine Learning · 4 min · about 3 hours ago

Llms

[2512.21106] Semantic Refinement with LLMs for Graph Representations

Abstract page for arXiv paper 2512.21106: Semantic Refinement with LLMs for Graph Representations

arXiv - Machine Learning · 4 min · about 3 hours ago

[2512.17131] Smoothing DiLoCo with Primal Averaging for Faster Training of LLMs

About this article

Related Articles

What is AI, how do apps like ChatGPT work and why are there concerns?

[2603.29957] Think Anywhere in Code Generation

[2603.16880] NeuroNarrator: A Generalist EEG-to-Text Foundation Model for Clinical Interpretation via Spectro-Spatial Grounding and Temporal State-Space Reasoning

[2512.21106] Semantic Refinement with LLMs for Graph Representations

No comments

Stay updated with AI News