[2601.08393] Controlled LLM Training on Spectral Sphere
About this article
Abstract page for arXiv paper 2601.08393: Controlled LLM Training on Spectral Sphere
Computer Science > Machine Learning arXiv:2601.08393 (cs) [Submitted on 13 Jan 2026 (v1), last revised 5 Mar 2026 (this version, v3)] Title:Controlled LLM Training on Spectral Sphere Authors:Tian Xie, Haoming Luo, Haoyu Tang, Yiwen Hu, Jason Klein Liu, Qingnan Ren, Yang Wang, Wayne Xin Zhao, Rui Yan, Bing Su, Chong Luo, Baining Guo View a PDF of the paper titled Controlled LLM Training on Spectral Sphere, by Tian Xie and 11 other authors View PDF HTML (experimental) Abstract:Scaling large models requires optimization strategies that ensure rapid convergence grounded in stability. Maximal Update Parametrization ($\boldsymbol{\mu}$P) provides a theoretical safeguard for width-invariant $\Theta(1)$ activation control, whereas emerging optimizers like Muon are only ``half-aligned'' with these constraints: they control updates but allow weights to drift. To address this limitation, we introduce the \textbf{Spectral Sphere Optimizer (SSO)}, which enforces strict module-wise spectral constraints on both weights and their updates. By deriving the steepest descent direction on the spectral sphere, SSO realizes a fully $\boldsymbol{\mu}$P-aligned optimization process. To enable large-scale training, we implement SSO as an efficient parallel algorithm within Megatron. Through extensive pretraining on diverse architectures, including Dense 1.7B, MoE 8B-A1B, and 200-layer DeepNet models, SSO consistently outperforms AdamW and Muon. Furthermore, we observe significant practical stabilit...