[2505.23725] MuLoCo: Muon is a practical inner optimizer for DiLoCo
Summary
The paper presents MuLoCo, a new inner optimizer for the DiLoCo framework, demonstrating its superior performance in training large language models compared to traditional methods like AdamW.
Why It Matters
As machine learning models grow in complexity, optimizing their training processes becomes crucial. MuLoCo addresses performance degradation in DiLoCo when scaling up the number of workers, thus enhancing efficiency in training large language models, which is vital for advancements in AI applications.
Key Takeaways
- MuLoCo outperforms DiLoCo and AdamW in training large language models.
- The choice of inner optimizer significantly impacts training efficiency.
- MuLoCo maintains performance even with increased worker counts.
- Hyperparameter tuning is essential for optimizing model performance.
- Compatibility with quantization and streaming enhances its utility.
Computer Science > Machine Learning arXiv:2505.23725 (cs) [Submitted on 29 May 2025 (v1), last revised 25 Feb 2026 (this version, v2)] Title:MuLoCo: Muon is a practical inner optimizer for DiLoCo Authors:Benjamin Thérien, Xiaolong Huang, Aaron Defazio, Irina Rish, Eugene Belilovsky View a PDF of the paper titled MuLoCo: Muon is a practical inner optimizer for DiLoCo, by Benjamin Th\'erien and 4 other authors View PDF HTML (experimental) Abstract:DiLoCo is a powerful framework for training large language models (LLMs), enabling larger optimal batch sizes and increased accelerator utilization under networking constraints. However, DiLoCo's performance has been shown to degrade as the number of workers (K) increases (Charles et al., 2025). In this work, we posit that a related but often overlooked factor in DiLoCo's behavior is the choice of inner optimizer, which shapes the pseudogradient used by the outer optimizer. Given the recent success of Muon relative to AdamW for data parallel (DP) training, we examine how Muon's normalized optimizer steps can affect the pseudogradient's quality. We find that, relative to AdamW, Muon yields more directionally correct pseudogradients as the number of workers (K) increases. In our experiments pre-training language models, we conduct extensive hyperparameter tuning across 150M, 416M, 914M, 1.76B, and 3.1B models for DiLoCo, MuLoCo, AdamW DP, and Muon DP. Consistently across all scales, we find that with K>=1 workers, MuLoCo (Muon inner ...