[2602.16340] The Implicit Bias of Adam and Muon on Smooth Homogeneous Neural Networks
Summary
This paper investigates the implicit bias of momentum-based optimizers like Adam and Muon in smooth homogeneous neural networks, extending existing theories on steepest descent methods.
Why It Matters
Understanding the implicit bias of different optimization algorithms is crucial for improving model performance in machine learning. This research contributes to the theoretical foundation of momentum-based optimizers, which are widely used in training neural networks, thereby influencing future algorithm development and application.
Key Takeaways
- Momentum-based optimizers exhibit implicit bias towards KKT points in homogeneous models.
- The choice of optimizer affects the margin maximized during training.
- The study extends previous work on steepest descent methods in linear models.
Computer Science > Machine Learning arXiv:2602.16340 (cs) [Submitted on 18 Feb 2026] Title:The Implicit Bias of Adam and Muon on Smooth Homogeneous Neural Networks Authors:Eitan Gronich, Gal Vardi View a PDF of the paper titled The Implicit Bias of Adam and Muon on Smooth Homogeneous Neural Networks, by Eitan Gronich and 1 other authors View PDF HTML (experimental) Abstract:We study the implicit bias of momentum-based optimizers on homogeneous models. We first extend existing results on the implicit bias of steepest descent in homogeneous models to normalized steepest descent with an optional learning rate schedule. We then show that for smooth homogeneous models, momentum steepest descent algorithms like Muon (spectral norm), MomentumGD ($\ell_2$ norm), and Signum ($\ell_\infty$ norm) are approximate steepest descent trajectories under a decaying learning rate schedule, proving that these algorithms too have a bias towards KKT points of the corresponding margin maximization problem. We extend the analysis to Adam (without the stability constant), which maximizes the $\ell_\infty$ margin, and to Muon-Signum and Muon-Adam, which maximize a hybrid norm. Our experiments corroborate the theory and show that the identity of the margin maximized depends on the choice of optimizer. Overall, our results extend earlier lines of work on steepest descent in homogeneous models and momentum-based optimizers in linear models. Comments: Subjects: Machine Learning (cs.LG); Machine Learnin...