[2602.21545] Muon+: Towards Better Muon via One Additional Normalization Step

[2602.21545] Muon+: Towards Better Muon via One Additional Normalization Step

arXiv - Machine Learning 3 min read Article

Summary

The paper introduces Muon+, an enhancement to the Muon optimizer, which adds an additional normalization step to improve performance in training large language models.

Why It Matters

As large language models continue to evolve, optimizing their training processes is crucial for enhancing performance and efficiency. Muon+ represents a significant step forward in this area, potentially leading to better outcomes in various applications of machine learning.

Key Takeaways

  • Muon+ improves upon the Muon optimizer by adding a normalization step.
  • Extensive experiments show consistent performance boosts across various model sizes.
  • The study evaluates models from 130M to 1B parameters, highlighting its scalability.
  • Muon+ achieves better training and validation perplexity compared to its predecessor.
  • The findings may influence future research and applications in large language model training.

Computer Science > Machine Learning arXiv:2602.21545 (cs) [Submitted on 25 Feb 2026] Title:Muon+: Towards Better Muon via One Additional Normalization Step Authors:Ruijie Zhang, Yequan Zhao, Ziyue Liu, Zhengyang Wang, Zheng Zhang View a PDF of the paper titled Muon+: Towards Better Muon via One Additional Normalization Step, by Ruijie Zhang and 4 other authors View PDF HTML (experimental) Abstract:The Muon optimizer has demonstrated promising performance in pre-training large language models through gradient (or momentum) orthogonalization. In this work, we propose a simple yet effective enhancement to Muon, namely Muon+, which introduces an additional normalization step after orthogonalization. We demonstrate the effectiveness of Muon+ through extensive pre-training experiments across a wide range of model scales and architectures. Our evaluation includes GPT-style models ranging from 130M to 774M parameters and LLaMA-style models ranging from 60M to 1B parameters. We comprehensively evaluate the effectiveness of Muon+ in the compute-optimal training regime and further extend the token-to-parameter (T2P) ratio to an industrial level of $\approx 200$. Experimental results show that Muon+ provides a consistent boost on training and validation perplexity over Muon. We provide our code here: this https URL. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2602.21545 [cs.LG]   (or arXiv:2602.21545v1 [cs.LG] for this version)   https://doi.org/10.48550/arXiv.2602.21545 Focus t...

Related Articles

Llms

OpenClaw security checklist: practical safeguards for AI agents

Here is one of the better quality guides on the ensuring safety when deploying OpenClaw: https://chatgptguide.ai/openclaw-security-checkl...

Reddit - Artificial Intelligence · 1 min ·
I let Gemini in Google Maps plan my day and it went surprisingly well | The Verge
Llms

I let Gemini in Google Maps plan my day and it went surprisingly well | The Verge

Gemini in Google Maps is a surprisingly useful way to explore new territory.

The Verge - AI · 11 min ·
Llms

The person who replaces you probably won't be AI. It'll be someone from the next department over who learned to use it - opinion/discussion

I'm a strategy person by background. Two years ago I'd write a recommendation and hand it to a product team. Now.. I describe what I want...

Reddit - Artificial Intelligence · 1 min ·
Block Resets Management With AI As Cash App Adds Installment Transfers
Llms

Block Resets Management With AI As Cash App Adds Installment Transfers

Block (NYSE:XYZ) plans a permanent organizational overhaul that replaces many middle management roles with AI-driven models to create fla...

AI Tools & Products · 5 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime