[2602.07712] Towards Robust Scaling Laws for Optimizers

[2602.07712] Towards Robust Scaling Laws for Optimizers

arXiv - Machine Learning 3 min read Article

Summary

This paper explores the scaling laws for various optimizers in machine learning, proposing a robust framework for comparing their performance as model size and training data increase.

Why It Matters

Understanding how different optimizers behave under scaling conditions is crucial for improving the efficiency and effectiveness of large language models. This research addresses gaps in existing studies that typically fix the optimizer, offering insights that could lead to better optimization strategies and model performance.

Key Takeaways

  • Existing Chinchilla-style scaling laws for optimizers are ill-conditioned and correlate poorly.
  • A new robust scaling law with shared power-law exponents is proposed for better optimizer comparison.
  • Theoretical analysis shows that Chinchilla-style scaling laws can emerge from loss decomposition.

Computer Science > Machine Learning arXiv:2602.07712 (cs) [Submitted on 7 Feb 2026 (v1), last revised 24 Feb 2026 (this version, v2)] Title:Towards Robust Scaling Laws for Optimizers Authors:Alexandra Volkova, Mher Safaryan, Christoph H. Lampert, Dan Alistarh View a PDF of the paper titled Towards Robust Scaling Laws for Optimizers, by Alexandra Volkova and 3 other authors View PDF HTML (experimental) Abstract:The quality of Large Language Model (LLM) pretraining depends on multiple factors, including the compute budget and the choice of optimization algorithm. Empirical scaling laws are widely used to predict loss as model size and training data grow, however, almost all existing studies fix the optimizer (typically AdamW). At the same time, a new generation of optimizers (e.g., Muon, Shampoo, SOAP) promises faster and more stable convergence, but their relationship with model and data scaling is not yet well understood. In this work, we study scaling laws across different optimizers. Empirically, we show that 1) separate Chinchilla-style scaling laws for each optimizer are ill-conditioned and have highly correlated parameters. Instead, 2) we propose a more robust law with shared power-law exponents and optimizer-specific rescaling factors, which enable direct comparison between optimizers. Finally, 3) we provide a theoretical analysis of gradient-based methods for the proxy task of a convex quadratic objective, demonstrating that Chinchilla-style scaling laws emerge natu...

Related Articles

Llms

[P] Dante-2B: I'm training a 2.1B bilingual fully open Italian/English LLM from scratch on 2×H200. Phase 1 done — here's what I've built.

The problem If you work with Italian text and local models, you know the pain. Every open-source LLM out there treats Italian as an after...

Reddit - Machine Learning · 1 min ·
Llms

I have been coding for 11 years and I caught myself completely unable to debug a problem without AI assistance last month. That scared me more than anything I have seen in this industry.

I want to be honest about something that happened to me because I think it is more common than people admit. Last month I hit a bug in a ...

Reddit - Artificial Intelligence · 1 min ·
Llms

OpenClaw security checklist: practical safeguards for AI agents

Here is one of the better quality guides on the ensuring safety when deploying OpenClaw: https://chatgptguide.ai/openclaw-security-checkl...

Reddit - Artificial Intelligence · 1 min ·
I let Gemini in Google Maps plan my day and it went surprisingly well | The Verge
Llms

I let Gemini in Google Maps plan my day and it went surprisingly well | The Verge

Gemini in Google Maps is a surprisingly useful way to explore new territory.

The Verge - AI · 11 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime