Llms Machine Learning Ai Infrastructure

[2602.21545] Muon+: Towards Better Muon via One Additional Normalization Step

arXiv - Machine Learning February 26, 2026 3 min read Article

Summary

The paper introduces Muon+, an enhancement to the Muon optimizer, which adds an additional normalization step to improve performance in training large language models.

Why It Matters

As large language models continue to evolve, optimizing their training processes is crucial for enhancing performance and efficiency. Muon+ represents a significant step forward in this area, potentially leading to better outcomes in various applications of machine learning.

Key Takeaways

Muon+ improves upon the Muon optimizer by adding a normalization step.
Extensive experiments show consistent performance boosts across various model sizes.
The study evaluates models from 130M to 1B parameters, highlighting its scalability.
Muon+ achieves better training and validation perplexity compared to its predecessor.
The findings may influence future research and applications in large language model training.

Computer Science > Machine Learning arXiv:2602.21545 (cs) [Submitted on 25 Feb 2026] Title:Muon+: Towards Better Muon via One Additional Normalization Step Authors:Ruijie Zhang, Yequan Zhao, Ziyue Liu, Zhengyang Wang, Zheng Zhang View a PDF of the paper titled Muon+: Towards Better Muon via One Additional Normalization Step, by Ruijie Zhang and 4 other authors View PDF HTML (experimental) Abstract:The Muon optimizer has demonstrated promising performance in pre-training large language models through gradient (or momentum) orthogonalization. In this work, we propose a simple yet effective enhancement to Muon, namely Muon+, which introduces an additional normalization step after orthogonalization. We demonstrate the effectiveness of Muon+ through extensive pre-training experiments across a wide range of model scales and architectures. Our evaluation includes GPT-style models ranging from 130M to 774M parameters and LLaMA-style models ranging from 60M to 1B parameters. We comprehensively evaluate the effectiveness of Muon+ in the compute-optimal training regime and further extend the token-to-parameter (T2P) ratio to an industrial level of $\approx 200$. Experimental results show that Muon+ provides a consistent boost on training and validation perplexity over Muon. We provide our code here: this https URL. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2602.21545 [cs.LG] (or arXiv:2602.21545v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.21545 Focus t...

Read Original Article

[2602.21545] Muon+: Towards Better Muon via One Additional Normalization Step

Summary

Why It Matters

Key Takeaways

Related Articles

OpenClaw security checklist: practical safeguards for AI agents

I let Gemini in Google Maps plan my day and it went surprisingly well | The Verge

The person who replaces you probably won't be AI. It'll be someone from the next department over who learned to use it - opinion/discussion

Block Resets Management With AI As Cash App Adds Installment Transfers

No comments

Stay updated with AI News