[2505.23725] MuLoCo: Muon is a practical inner optimizer for DiLoCo

[2505.23725] MuLoCo: Muon is a practical inner optimizer for DiLoCo

arXiv - Machine Learning 4 min read Article

Summary

The paper presents MuLoCo, a new inner optimizer for the DiLoCo framework, demonstrating its superior performance in training large language models compared to traditional methods like AdamW.

Why It Matters

As machine learning models grow in complexity, optimizing their training processes becomes crucial. MuLoCo addresses performance degradation in DiLoCo when scaling up the number of workers, thus enhancing efficiency in training large language models, which is vital for advancements in AI applications.

Key Takeaways

  • MuLoCo outperforms DiLoCo and AdamW in training large language models.
  • The choice of inner optimizer significantly impacts training efficiency.
  • MuLoCo maintains performance even with increased worker counts.
  • Hyperparameter tuning is essential for optimizing model performance.
  • Compatibility with quantization and streaming enhances its utility.

Computer Science > Machine Learning arXiv:2505.23725 (cs) [Submitted on 29 May 2025 (v1), last revised 25 Feb 2026 (this version, v2)] Title:MuLoCo: Muon is a practical inner optimizer for DiLoCo Authors:Benjamin Thérien, Xiaolong Huang, Aaron Defazio, Irina Rish, Eugene Belilovsky View a PDF of the paper titled MuLoCo: Muon is a practical inner optimizer for DiLoCo, by Benjamin Th\'erien and 4 other authors View PDF HTML (experimental) Abstract:DiLoCo is a powerful framework for training large language models (LLMs), enabling larger optimal batch sizes and increased accelerator utilization under networking constraints. However, DiLoCo's performance has been shown to degrade as the number of workers (K) increases (Charles et al., 2025). In this work, we posit that a related but often overlooked factor in DiLoCo's behavior is the choice of inner optimizer, which shapes the pseudogradient used by the outer optimizer. Given the recent success of Muon relative to AdamW for data parallel (DP) training, we examine how Muon's normalized optimizer steps can affect the pseudogradient's quality. We find that, relative to AdamW, Muon yields more directionally correct pseudogradients as the number of workers (K) increases. In our experiments pre-training language models, we conduct extensive hyperparameter tuning across 150M, 416M, 914M, 1.76B, and 3.1B models for DiLoCo, MuLoCo, AdamW DP, and Muon DP. Consistently across all scales, we find that with K>=1 workers, MuLoCo (Muon inner ...

Related Articles

Claude Code leak exposes a Tamagotchi-style ‘pet’ and an always-on agent | The Verge
Llms

Claude Code leak exposes a Tamagotchi-style ‘pet’ and an always-on agent | The Verge

Anthropic says “human error” resulted in a leak that exposed Claude Code’s source code. The leaked code, which has since been copied to G...

The Verge - AI · 4 min ·
You can now use ChatGPT with Apple’s CarPlay | The Verge
Llms

You can now use ChatGPT with Apple’s CarPlay | The Verge

ChatGPT is now accessible from your CarPlay dashboard if you have iOS 26.4 or newer and the latest version of the ChatGPT app.

The Verge - AI · 3 min ·
Llms

Have Companies Began Adopting Claude Co-Work at an Enterprise Level?

Hi Guys, My company is considering purchasing the Claude Enterprise plan. The main two constraints are: - Being able to block usage of Cl...

Reddit - Artificial Intelligence · 1 min ·
Llms

What I learned about multi-agent coordination running 9 specialized Claude agents

I've been experimenting with multi-agent AI systems and ended up building something more ambitious than I originally planned: a fully ope...

Reddit - Artificial Intelligence · 1 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime