Llms Machine Learning Data Science

[2602.06797] Optimal Learning-Rate Schedules under Functional Scaling Laws: Power Decay and Warmup-Stable-Decay

arXiv - Machine Learning February 17, 2026 4 min read Article

Summary

This paper explores optimal learning-rate schedules (LRSs) within the functional scaling law framework, revealing distinct behaviors in easy and hard task regimes, and providing insights into practical applications in machine learning.

Why It Matters

Understanding optimal learning-rate schedules is crucial for improving training efficiency in machine learning models, particularly in large language models (LLMs). This research offers theoretical foundations and practical insights that can enhance model performance and reduce training time.

Key Takeaways

Optimal learning-rate schedules vary significantly between easy and hard tasks.
Power decay and warmup-stable-decay are key strategies for effective training.
The study provides a principled evaluation of commonly used learning-rate schedules.
Numerical experiments validate the theoretical predictions of optimal LRSs.
Insights from this research can guide practitioners in tuning learning rates for better model performance.

Statistics > Machine Learning arXiv:2602.06797 (stat) [Submitted on 6 Feb 2026 (v1), last revised 15 Feb 2026 (this version, v2)] Title:Optimal Learning-Rate Schedules under Functional Scaling Laws: Power Decay and Warmup-Stable-Decay Authors:Binghui Li, Zilin Wang, Fengling Chen, Shiyang Zhao, Ruiheng Zheng, Lei Wu View a PDF of the paper titled Optimal Learning-Rate Schedules under Functional Scaling Laws: Power Decay and Warmup-Stable-Decay, by Binghui Li and 5 other authors View PDF HTML (experimental) Abstract:We study optimal learning-rate schedules (LRSs) under the functional scaling law (FSL) framework introduced in Li et al. (2025), which accurately models the loss dynamics of both linear regression and large language model (LLM) pre-training. Within FSL, loss dynamics are governed by two exponents: a source exponent $s>0$ controlling the rate of signal learning, and a capacity exponent $\beta>1$ determining the rate of noise forgetting. Focusing on a fixed training horizon $N$, we derive the optimal LRSs and reveal a sharp phase transition. In the easy-task regime $s \ge 1 - 1/\beta$, the optimal schedule follows a power decay to zero, $\eta^*(z) = \eta_{\mathrm{peak}}(1 - z/N)^{2\beta - 1}$, where the peak learning rate scales as $\eta_{\mathrm{peak}} \eqsim N^{-\nu}$ for an explicit exponent $\nu = \nu(s,\beta)$. In contrast, in the hard-task regime $s < 1 - 1/\beta$, the optimal LRS exhibits a warmup-stable-decay (WSD) (Hu et al. (2024)) structure: it maintain...

Read Original Article

[2602.06797] Optimal Learning-Rate Schedules under Functional Scaling Laws: Power Decay and Warmup-Stable-Decay

Summary

Why It Matters

Key Takeaways

Related Articles

Florida AG announces investigation into OpenAI over shooting that allegedly involved ChatGPT | TechCrunch

We’re open-sourcing a 33-benchmark diagnostic for AI alignment gaps, launches April 27

Google’s Gemini AI can answer your questions with 3D models and simulations

Google’s Gemini AI can answer your questions with 3D models and simulations | The Verge

No comments

Stay updated with AI News