[2602.19142] Celo2: Towards Learned Optimization Free Lunch

[2602.19142] Celo2: Towards Learned Optimization Free Lunch

arXiv - AI 3 min read Article

Summary

The paper 'Celo2: Towards Learned Optimization Free Lunch' presents a novel learned optimizer that significantly reduces the computational cost of meta-training while achieving strong performance across diverse tasks.

Why It Matters

This research addresses the limitations of existing learned optimizers, which often require extensive computational resources and fail to generalize effectively. By proposing a more efficient architecture, this work opens avenues for practical applications in machine learning, potentially enhancing the efficiency of training large models.

Key Takeaways

  • Introduces a new learned optimizer that requires only 4.5 GPU hours for meta-training.
  • Demonstrates strong performance on tasks significantly larger than its training distribution.
  • Highlights the potential for learned optimizers to generalize better across diverse tasks.

Computer Science > Machine Learning arXiv:2602.19142 (cs) [Submitted on 22 Feb 2026] Title:Celo2: Towards Learned Optimization Free Lunch Authors:Abhinav Moudgil, Boris Knyazev, Eugene Belilovsky View a PDF of the paper titled Celo2: Towards Learned Optimization Free Lunch, by Abhinav Moudgil and 2 other authors View PDF Abstract:Learned optimizers are powerful alternatives to hand-designed update rules like Adam, yet they have seen limited practical adoption since they often fail to meta-generalize beyond their training distribution and incur high meta-training cost. For instance, prior work, VeLO, scaled meta-training to 4,000 TPU months ($\sim$10$\times$ GPT-3 compute) to meta-train a general-purpose optimizer but it failed to generalize beyond 600M parameters tasks. In this work, we present a surprising finding: by crafting a simple normalized optimizer architecture and augmenting meta-training, it becomes feasible to meta-train a performant general-purpose learned update rule on a tiny fraction of VeLO compute, 4.5 GPU hours to be precise. Our learned update rule scales stably to a billion-scale pretraining task (GPT-3 XL 1.3B) which is six orders of magnitude larger than its meta-training distribution. Furthermore, it shows strong performance across diverse out-of-distribution tasks and is compatible with modern optimization harness that includes orthogonalization, distinct update rules for input-output and hidden weights, and decoupled weight decay. In all, this wor...

Related Articles

Llms

[R] Reference model free behavioral discovery of AudiBench model organisms via Probe-Mediated Adaptive Auditing

Anthropic's AuditBench - 56 Llama 3.3 70B models with planted hidden behaviors - their best agent detects the behaviros 10-13% of the tim...

Reddit - Machine Learning · 1 min ·
Llms

[P] Dante-2B: I'm training a 2.1B bilingual fully open Italian/English LLM from scratch on 2×H200. Phase 1 done — here's what I've built.

The problem If you work with Italian text and local models, you know the pain. Every open-source LLM out there treats Italian as an after...

Reddit - Machine Learning · 1 min ·
Llms

I have been coding for 11 years and I caught myself completely unable to debug a problem without AI assistance last month. That scared me more than anything I have seen in this industry.

I want to be honest about something that happened to me because I think it is more common than people admit. Last month I hit a bug in a ...

Reddit - Artificial Intelligence · 1 min ·
Llms

OpenClaw security checklist: practical safeguards for AI agents

Here is one of the better quality guides on the ensuring safety when deploying OpenClaw: https://chatgptguide.ai/openclaw-security-checkl...

Reddit - Artificial Intelligence · 1 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime