[2510.19093] Weight Decay may matter more than muP for Learning Rate Transfer in Practice

[2510.19093] Weight Decay may matter more than muP for Learning Rate Transfer in Practice

arXiv - Machine Learning 4 min read Article

Summary

This article investigates the role of weight decay versus the Maximal Update Parameterization (muP) in learning rate transfer for neural networks, revealing that weight decay is more crucial for stabilizing update dynamics during training.

Why It Matters

Understanding the dynamics of learning rate transfer is essential for optimizing neural network training, especially in large-scale applications. This research challenges existing assumptions about muP and emphasizes the importance of weight decay, which could lead to more efficient training strategies in machine learning.

Key Takeaways

  • Weight decay stabilizes update dynamics better than muP during training.
  • muP's scaling primarily serves as an implicit learning rate warmup.
  • The findings challenge existing beliefs about learning rate transfer.
  • Modified warmup schedules can effectively replace muP.
  • Empirical observations support the need for independent weight decay for effective transfer.

Computer Science > Machine Learning arXiv:2510.19093 (cs) [Submitted on 21 Oct 2025 (v1), last revised 13 Feb 2026 (this version, v2)] Title:Weight Decay may matter more than muP for Learning Rate Transfer in Practice Authors:Atli Kosson, Jeremy Welborn, Yang Liu, Martin Jaggi, Xi Chen View a PDF of the paper titled Weight Decay may matter more than muP for Learning Rate Transfer in Practice, by Atli Kosson and 4 other authors View PDF HTML (experimental) Abstract:Transferring the optimal learning rate from small to large neural networks can enable efficient training at scales where hyperparameter tuning is otherwise prohibitively expensive. To this end, the Maximal Update Parameterization (muP) proposes a learning rate scaling designed to keep the update dynamics of internal representations stable across different model widths. However, the scaling rules of muP rely on strong assumptions, particularly about the geometric alignment of a layer's inputs with both its weights and gradient updates. In this large-scale empirical investigation, we show that these assumptions hold only briefly at the start of training in the practical setups where learning rate transfer is most valuable, such as LLM training. For the remainder of training it is weight decay rather than muP that correctly stabilizes the update dynamics of internal representations across widths, facilitating learning rate transfer. This suggests muP's scaling primarily acts as a form of implicit learning rate warmu...

Related Articles

Machine Learning

Fed Chair Jerome Powell, Treasury's Bessent and top bank CEOs met over Anthropic's Mythos model

Reddit - Artificial Intelligence · 1 min ·
CoreWeave strikes a deal to power Anthropic's Claude AI models — and the stock surges 12%
Llms

CoreWeave strikes a deal to power Anthropic's Claude AI models — and the stock surges 12%

AI Tools & Products · 3 min ·
New AI model sparks alarm as governments brace for AI-driven cyberattacks
Machine Learning

New AI model sparks alarm as governments brace for AI-driven cyberattacks

AI Tools & Products · 6 min ·
Machine Learning

Anthropic Model Scare Sparks Urgent Bessent, Powell Warning to Bank CEOs

Please make sure your browser supports JavaScript and cookies and that you are not blocking them from loading. ...

AI Tools & Products · 1 min ·
More in Machine Learning: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime