[2602.16340] The Implicit Bias of Adam and Muon on Smooth Homogeneous Neural Networks

[2602.16340] The Implicit Bias of Adam and Muon on Smooth Homogeneous Neural Networks

arXiv - Machine Learning 3 min read Article

Summary

This paper investigates the implicit bias of momentum-based optimizers like Adam and Muon in smooth homogeneous neural networks, extending existing theories on steepest descent methods.

Why It Matters

Understanding the implicit bias of different optimization algorithms is crucial for improving model performance in machine learning. This research contributes to the theoretical foundation of momentum-based optimizers, which are widely used in training neural networks, thereby influencing future algorithm development and application.

Key Takeaways

  • Momentum-based optimizers exhibit implicit bias towards KKT points in homogeneous models.
  • The choice of optimizer affects the margin maximized during training.
  • The study extends previous work on steepest descent methods in linear models.

Computer Science > Machine Learning arXiv:2602.16340 (cs) [Submitted on 18 Feb 2026] Title:The Implicit Bias of Adam and Muon on Smooth Homogeneous Neural Networks Authors:Eitan Gronich, Gal Vardi View a PDF of the paper titled The Implicit Bias of Adam and Muon on Smooth Homogeneous Neural Networks, by Eitan Gronich and 1 other authors View PDF HTML (experimental) Abstract:We study the implicit bias of momentum-based optimizers on homogeneous models. We first extend existing results on the implicit bias of steepest descent in homogeneous models to normalized steepest descent with an optional learning rate schedule. We then show that for smooth homogeneous models, momentum steepest descent algorithms like Muon (spectral norm), MomentumGD ($\ell_2$ norm), and Signum ($\ell_\infty$ norm) are approximate steepest descent trajectories under a decaying learning rate schedule, proving that these algorithms too have a bias towards KKT points of the corresponding margin maximization problem. We extend the analysis to Adam (without the stability constant), which maximizes the $\ell_\infty$ margin, and to Muon-Signum and Muon-Adam, which maximize a hybrid norm. Our experiments corroborate the theory and show that the identity of the margin maximized depends on the choice of optimizer. Overall, our results extend earlier lines of work on steepest descent in homogeneous models and momentum-based optimizers in linear models. Comments: Subjects: Machine Learning (cs.LG); Machine Learnin...

Related Articles

Google quietly releases an offline-first AI dictation app on iOS | TechCrunch
Machine Learning

Google quietly releases an offline-first AI dictation app on iOS | TechCrunch

Google's new offline-first dictation app uses Gemma AI models to take on the apps like Wispr Flow.

TechCrunch - AI · 4 min ·
Machine Learning

How well do you understand how AI/deep learning works?

Specifically, how AI are programmed, trained, and how they perform their functions. I’ll be asking this in different subs to see if/how t...

Reddit - Artificial Intelligence · 1 min ·
Machine Learning

a fun survey to look at how consumers perceive the use of AI in fashion brand marketing. (all ages, all genders)

Hi r/artificial ! I'm posting on behalf of a friend who is conducting academic research for their dissertation. The survey looks at how c...

Reddit - Artificial Intelligence · 1 min ·
Machine Learning

I Built a Functional Cognitive Engine

Aura: https://github.com/youngbryan97/aura Aura is not a chatbot with personality prompts. It is a complete cognitive architecture — 60+ ...

Reddit - Artificial Intelligence · 1 min ·
More in Machine Learning: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime