[2602.06797] Optimal Learning-Rate Schedules under Functional Scaling Laws: Power Decay and Warmup-Stable-Decay
Summary
This paper explores optimal learning-rate schedules (LRSs) within the functional scaling law framework, revealing distinct behaviors in easy and hard task regimes, and providing insights into practical applications in machine learning.
Why It Matters
Understanding optimal learning-rate schedules is crucial for improving training efficiency in machine learning models, particularly in large language models (LLMs). This research offers theoretical foundations and practical insights that can enhance model performance and reduce training time.
Key Takeaways
- Optimal learning-rate schedules vary significantly between easy and hard tasks.
- Power decay and warmup-stable-decay are key strategies for effective training.
- The study provides a principled evaluation of commonly used learning-rate schedules.
- Numerical experiments validate the theoretical predictions of optimal LRSs.
- Insights from this research can guide practitioners in tuning learning rates for better model performance.
Statistics > Machine Learning arXiv:2602.06797 (stat) [Submitted on 6 Feb 2026 (v1), last revised 15 Feb 2026 (this version, v2)] Title:Optimal Learning-Rate Schedules under Functional Scaling Laws: Power Decay and Warmup-Stable-Decay Authors:Binghui Li, Zilin Wang, Fengling Chen, Shiyang Zhao, Ruiheng Zheng, Lei Wu View a PDF of the paper titled Optimal Learning-Rate Schedules under Functional Scaling Laws: Power Decay and Warmup-Stable-Decay, by Binghui Li and 5 other authors View PDF HTML (experimental) Abstract:We study optimal learning-rate schedules (LRSs) under the functional scaling law (FSL) framework introduced in Li et al. (2025), which accurately models the loss dynamics of both linear regression and large language model (LLM) pre-training. Within FSL, loss dynamics are governed by two exponents: a source exponent $s>0$ controlling the rate of signal learning, and a capacity exponent $\beta>1$ determining the rate of noise forgetting. Focusing on a fixed training horizon $N$, we derive the optimal LRSs and reveal a sharp phase transition. In the easy-task regime $s \ge 1 - 1/\beta$, the optimal schedule follows a power decay to zero, $\eta^*(z) = \eta_{\mathrm{peak}}(1 - z/N)^{2\beta - 1}$, where the peak learning rate scales as $\eta_{\mathrm{peak}} \eqsim N^{-\nu}$ for an explicit exponent $\nu = \nu(s,\beta)$. In contrast, in the hard-task regime $s < 1 - 1/\beta$, the optimal LRS exhibits a warmup-stable-decay (WSD) (Hu et al. (2024)) structure: it maintain...