Machine Learning Ai Agents

[2511.01734] A Proof of Learning Rate Transfer under $μ$P

arXiv - Machine Learning February 26, 2026 3 min read Article

Summary

This paper presents a proof of learning rate transfer in linear multi-layer perceptrons (MLPs) using a new parameterization method called μP, demonstrating that the optimal learning rate stabilizes as network width increases.

Why It Matters

Understanding learning rate transfer is crucial for optimizing neural network training. This research provides theoretical insights that could enhance model performance and efficiency, particularly in deep learning contexts where parameterization plays a significant role.

Key Takeaways

The paper introduces a novel proof of learning rate transfer in MLPs using μP.
Under μP, the optimal learning rate approaches a non-zero constant as network width increases.
This behavior contrasts with traditional parameterizations like Standard Parametrization and Neural Tangent Parametrization.
The findings are supported by both theoretical proofs and extensive empirical results.
This research could inform better training practices in deep learning applications.

Statistics > Machine Learning arXiv:2511.01734 (stat) [Submitted on 3 Nov 2025 (v1), last revised 24 Feb 2026 (this version, v3)] Title:A Proof of Learning Rate Transfer under $μ$P Authors:Soufiane Hayou View a PDF of the paper titled A Proof of Learning Rate Transfer under $\mu$P, by Soufiane Hayou View PDF HTML (experimental) Abstract:We provide the first proof of learning rate transfer with width in a linear multi-layer perceptron (MLP) parametrized with $\mu$P, a neural network parameterization designed to ``maximize'' feature learning in the infinite-width limit. We show that under $\mu P$, the optimal learning rate converges to a \emph{non-zero constant} as width goes to infinity, providing a theoretical explanation to learning rate transfer. In contrast, we show that this property fails to hold under alternative parametrizations such as Standard Parametrization (SP) and Neural Tangent Parametrization (NTP). We provide intuitive proofs and support the theoretical findings with extensive empirical results. Comments: Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2511.01734 [stat.ML] (or arXiv:2511.01734v3 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2511.01734 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Soufiane Hayou [view email] [v1] Mon, 3 Nov 2025 16:45:47 UTC (220 KB) [v2] Mon, 2 Feb 2026 16:33:34 UTC (227 KB) [v3] Tue...

Read Original Article

Ai Startups

20+ Best AI Project Ideas for 2026: Trending AI Projects

This article presents over 20 AI project ideas tailored for various skill levels, providing a roadmap for building portfolio-ready projec...

AI Events · 29 minutes ago

Ai Startups

Top 10 AI certifications and courses for 2026

This article reviews the top 10 AI certifications and courses for 2026, highlighting their significance in a rapidly evolving field and t...

AI Events · 15 min · 29 minutes ago

Machine Learning

[P] Looking for people who have had training runs fail unexpectedly to beta test a stability monitor. Free, takes 5 minutes to add to your existing loop. DM me.

Anyone actively training models want to try a stability monitor on a real run? Trying to get real world validation outside my own benchma...

Reddit - Machine Learning · 1 min · about 1 hour ago

Llms

Is the Mirage Effect a bug, or is it Geometric Reconstruction in action? A framework for why VLMs perform better "hallucinating" than guessing, and what that may tell us about what's really inside these models

Last week, a team from Stanford and UCSF (Asadi, O'Sullivan, Fei-Fei Li, Euan Ashley et al.) dropped two companion papers. The first, MAR...

Reddit - Artificial Intelligence · 1 min · about 2 hours ago