[2602.13498] TrasMuon: Trust-Region Adaptive Scaling for Orthogonalized Momentum Optimizers
Summary
TrasMuon introduces a novel optimization technique that enhances the stability and efficiency of orthogonalized momentum optimizers, outperforming traditional methods in empirical tests.
Why It Matters
This research addresses critical challenges in machine learning optimization, particularly the sensitivity of existing methods to hyperparameters and high-energy bursts. By improving optimization stability and convergence rates, TrasMuon could significantly enhance model training processes across various applications in AI.
Key Takeaways
- TrasMuon stabilizes optimization by preserving near-isometric geometry while adapting magnitudes.
- The method incorporates global RMS calibration and energy-based trust-region clipping to enhance stability.
- Empirical results show TrasMuon converges faster than traditional baselines in vision and language models.
- The approach mitigates issues related to high-energy outliers that can destabilize training.
- TrasMuon demonstrates superior robustness without requiring warmup stages.
Computer Science > Machine Learning arXiv:2602.13498 (cs) [Submitted on 13 Feb 2026] Title:TrasMuon: Trust-Region Adaptive Scaling for Orthogonalized Momentum Optimizers Authors:Peng Cheng, Jiucheng Zang, Qingnan Li, Liheng Ma, Yufei Cui, Yingxue Zhang, Boxing Chen, Ming Jian, Wen Tong View a PDF of the paper titled TrasMuon: Trust-Region Adaptive Scaling for Orthogonalized Momentum Optimizers, by Peng Cheng and 8 other authors View PDF HTML (experimental) Abstract:Muon-style optimizers leverage Newton-Schulz (NS) iterations to orthogonalize updates, yielding update geometries that often outperform Adam-series methods. However, this orthogonalization discards magnitude information, rendering training sensitive to step-size hyperparameters and vulnerable to high-energy bursts. To mitigate this, we introduce TrasMuon (\textbf{T}rust \textbf{R}egion \textbf{A}daptive \textbf{S}caling \textbf{Muon}). TrasMuon preserves the near-isometric geometry of Muon while stabilizing magnitudes through (i) global RMS calibration and (ii) energy-based trust-region clipping. We demonstrate that while reintroducing adaptive scaling improves optimization efficiency, it typically exacerbates instability due to high-energy outliers. TrasMuon addresses this by defining a trust region based on relative energy ratios, confining updates to a stable zone. Empirical experiments on vision and language models demonstrate that TrasMuon converges faster than baselines. Furthermore, experiments without wa...