[2604.01472] The Newton-Muon Optimizer
About this article
Abstract page for arXiv paper 2604.01472: The Newton-Muon Optimizer
Mathematics > Optimization and Control arXiv:2604.01472 (math) [Submitted on 1 Apr 2026] Title:The Newton-Muon Optimizer Authors:Zhehang Du, Weijie Su View a PDF of the paper titled The Newton-Muon Optimizer, by Zhehang Du and Weijie Su View PDF HTML (experimental) Abstract:The Muon optimizer has received considerable attention for its strong performance in training large language models, yet the design principle behind its matrix-gradient orthogonalization remains largely elusive. In this paper, we introduce a surrogate model that not only sheds new light on the design of Muon, but more importantly leads to a new optimizer. In the same spirit as the derivation of Newton's method, the surrogate approximates the loss as a quadratic function of the perturbation to a weight matrix $W$ using only three matrices: the gradient $G$, an output-space curvature matrix $H$, and the data matrix $Z$ that stacks the layer inputs. By minimizing this surrogate in one step and adopting a certain isotropic assumption on the weights, we obtain the closed-form update rule (up to momentum and weight decay) $W \leftarrow W - \eta \cdot \mathrm{msgn}(G(ZZ^\top)^{-1})$, where $\eta$ is the learning rate and $\mathrm{msgn}(X)=UV^\top$ if $X=USV^\top$ is a compact singular value decomposition. This new optimization method, which we refer to as Newton-Muon, shows that standard Muon can be interpreted as an implicit Newton-type method that neglects the right preconditioning induced by the input secon...