[2603.00541] Spectral Condition for $μ$P under Width-Depth Scaling
About this article
Abstract page for arXiv paper 2603.00541: Spectral Condition for $μ$P under Width-Depth Scaling
Computer Science > Machine Learning arXiv:2603.00541 (cs) [Submitted on 28 Feb 2026] Title:Spectral Condition for $μ$P under Width-Depth Scaling Authors:Chenyu Zheng, Rongzhen Wang, Xinyu Zhang, Chongxuan Li View a PDF of the paper titled Spectral Condition for $\mu$P under Width-Depth Scaling, by Chenyu Zheng and 3 other authors View PDF HTML (experimental) Abstract:Generative foundation models are increasingly scaled in both width and depth, posing significant challenges for stable feature learning and reliable hyperparameter (HP) transfer across model sizes. While maximal update parameterization ($\mu$P) has provided a principled solution to both problems for width scaling, existing extensions to the joint width-depth scaling regime remain fragmented, architecture- and optimizer-specific, and often rely on technically involved theories. In this work, we develop a simple and unified spectral framework for $\mu$P under joint width-depth scaling. Considering residual networks of varying block depths, we first introduce a spectral $\mu$P condition that precisely characterizes how the norms of weights and their per-step updates should scale with width and depth, unifying previously disparate $\mu$P formulations as special cases. Building on this condition, we then derive a general recipe for implementing $\mu$P across a broad class of optimizers by mapping the spectral constraints to concrete HP parameterizations. This approach not only recovers existing $\mu$P formulations ...