[2602.05298] Logarithmic-time Schedules for Scaling Language Models with Momentum
Summary
This article presents a novel optimizer, ADANA, which utilizes logarithmic-time scheduling for hyperparameters in large-scale language model training, achieving significant performance gains.
Why It Matters
As language models grow in size and complexity, optimizing their training efficiency becomes crucial. This research introduces a method that enhances compute efficiency by up to 40%, making it relevant for developers and researchers in machine learning and AI, particularly those focused on optimizing large models.
Key Takeaways
- ADANA optimizer improves efficiency by utilizing logarithmic-time scheduling for hyperparameters.
- The method achieves up to 40% compute efficiency compared to traditional AdamW optimizers.
- Longer gradient memory horizons can enhance performance in large-scale language model training.
- Damping mechanisms are essential for maintaining stability in the new scheduling approach.
- Logarithmic-time scheduling benefits can also be applied to other optimizers like AdEMAMix.
Statistics > Machine Learning arXiv:2602.05298 (stat) [Submitted on 5 Feb 2026 (v1), last revised 18 Feb 2026 (this version, v2)] Title:Logarithmic-time Schedules for Scaling Language Models with Momentum Authors:Damien Ferbach, Courtney Paquette, Gauthier Gidel, Katie Everett, Elliot Paquette View a PDF of the paper titled Logarithmic-time Schedules for Scaling Language Models with Momentum, by Damien Ferbach and 4 other authors View PDF Abstract:In practice, the hyperparameters $(\beta_1, \beta_2)$ and weight-decay $\lambda$ in AdamW are typically kept at fixed values. Is there any reason to do otherwise? We show that for large-scale language model training, the answer is yes: by exploiting the power-law structure of language data, one can design time-varying schedules for $(\beta_1, \beta_2, \lambda)$ that deliver substantial performance gains. We study logarithmic-time scheduling, in which the optimizer's gradient memory horizon grows with training time. Although naive variants of this are unstable, we show that suitable damping mechanisms restore stability while preserving the benefits of longer memory. Based on this, we present ADANA, an AdamW-like optimizer that couples log-time schedules with explicit damping to balance stability and performance. We empirically evaluate ADANA across transformer scalings (45M to 2.6B parameters), comparing against AdamW, Muon, and AdEMAMix. When properly tuned, ADANA achieves up to 40% compute efficiency relative to a tuned AdamW, w...