[2509.23106] Effective Quantization of Muon Optimizer States
Summary
The paper presents the 8-bit Muon optimizer, which enhances computational efficiency and reduces memory usage in large-scale machine learning models, achieving performance parity with traditional optimizers.
Why It Matters
As machine learning models grow in size, optimizing memory usage without sacrificing performance is crucial. This research introduces a quantization method that addresses these challenges, making it relevant for developers and researchers in the field of machine learning.
Key Takeaways
- The 8-bit Muon optimizer achieves significant memory savings while maintaining performance.
- It outperforms the traditional AdamW optimizer in terms of computational efficiency.
- The quantization method used is simpler and more effective than dynamic scaling approaches.
Computer Science > Machine Learning arXiv:2509.23106 (cs) [Submitted on 27 Sep 2025 (v1), last revised 16 Feb 2026 (this version, v2)] Title:Effective Quantization of Muon Optimizer States Authors:Aman Gupta, Rafael Celente, Abhishek Shivanna, D.T. Braithwaite, Gregory Dexter, Shao Tang, Hiroto Udagawa, Daniel Silva, Rohan Ramanath, S. Sathiya Keerthi View a PDF of the paper titled Effective Quantization of Muon Optimizer States, by Aman Gupta and 9 other authors View PDF HTML (experimental) Abstract:The Muon optimizer, based on matrix orthogonalization, has recently shown faster convergence and better computational efficiency over AdamW in LLM pre-training. However, the memory overhead of maintaining high-precision optimizer states remains a challenge for large-scale deployment. In this paper, we introduce the 8-bit Muon optimizer using blockwise quantization. In extensive Chinchilla-optimal experiments on pre-training models of up to 2.7B in size and fine-tuning them for instruction following, we demonstrate that 8-bit Muon achieves parity with Muon in terms of validation loss and downstream benchmarks, while achieving up to a 62\% reduction in optimizer state footprint. Crucially, we show that Muon's update mechanism is uniquely compatible with a simple linear quantization scheme, bypassing the complex dynamic scaling required for quantized AdamW. We supplement our empirical findings with a theoretical analysis of Muon's robustness to quantization noise. Comments: Subje...