[2602.19142] Celo2: Towards Learned Optimization Free Lunch
Summary
The paper 'Celo2: Towards Learned Optimization Free Lunch' presents a novel learned optimizer that significantly reduces the computational cost of meta-training while achieving strong performance across diverse tasks.
Why It Matters
This research addresses the limitations of existing learned optimizers, which often require extensive computational resources and fail to generalize effectively. By proposing a more efficient architecture, this work opens avenues for practical applications in machine learning, potentially enhancing the efficiency of training large models.
Key Takeaways
- Introduces a new learned optimizer that requires only 4.5 GPU hours for meta-training.
- Demonstrates strong performance on tasks significantly larger than its training distribution.
- Highlights the potential for learned optimizers to generalize better across diverse tasks.
Computer Science > Machine Learning arXiv:2602.19142 (cs) [Submitted on 22 Feb 2026] Title:Celo2: Towards Learned Optimization Free Lunch Authors:Abhinav Moudgil, Boris Knyazev, Eugene Belilovsky View a PDF of the paper titled Celo2: Towards Learned Optimization Free Lunch, by Abhinav Moudgil and 2 other authors View PDF Abstract:Learned optimizers are powerful alternatives to hand-designed update rules like Adam, yet they have seen limited practical adoption since they often fail to meta-generalize beyond their training distribution and incur high meta-training cost. For instance, prior work, VeLO, scaled meta-training to 4,000 TPU months ($\sim$10$\times$ GPT-3 compute) to meta-train a general-purpose optimizer but it failed to generalize beyond 600M parameters tasks. In this work, we present a surprising finding: by crafting a simple normalized optimizer architecture and augmenting meta-training, it becomes feasible to meta-train a performant general-purpose learned update rule on a tiny fraction of VeLO compute, 4.5 GPU hours to be precise. Our learned update rule scales stably to a billion-scale pretraining task (GPT-3 XL 1.3B) which is six orders of magnitude larger than its meta-training distribution. Furthermore, it shows strong performance across diverse out-of-distribution tasks and is compatible with modern optimization harness that includes orthogonalization, distinct update rules for input-output and hidden weights, and decoupled weight decay. In all, this wor...