[2602.17155] Powering Up Zeroth-Order Training via Subspace Gradient Orthogonalization

[2602.17155] Powering Up Zeroth-Order Training via Subspace Gradient Orthogonalization

arXiv - Machine Learning 4 min read Article

Summary

The paper introduces ZO-Muon, a novel zeroth-order optimization method that enhances convergence speed and accuracy in training large-scale models by utilizing subspace gradient orthogonalization.

Why It Matters

This research addresses the limitations of zeroth-order optimization, which is crucial for fine-tuning large models without backpropagation. By improving query efficiency and accuracy, it has significant implications for machine learning, particularly in resource-constrained environments.

Key Takeaways

  • ZO optimization offers a gradient-free alternative to first-order methods, enhancing memory efficiency.
  • The ZO-Muon method significantly reduces the number of queries needed for effective model fine-tuning.
  • Improvements in accuracy and efficiency were demonstrated on large language models and vision transformers.

Computer Science > Machine Learning arXiv:2602.17155 (cs) [Submitted on 19 Feb 2026] Title:Powering Up Zeroth-Order Training via Subspace Gradient Orthogonalization Authors:Yicheng Lang, Changsheng Wang, Yihua Zhang, Mingyi Hong, Zheng Zhang, Wotao Yin, Sijia Liu View a PDF of the paper titled Powering Up Zeroth-Order Training via Subspace Gradient Orthogonalization, by Yicheng Lang and 6 other authors View PDF HTML (experimental) Abstract:Zeroth-order (ZO) optimization provides a gradient-free alternative to first-order (FO) methods by estimating gradients via finite differences of function evaluations, and has recently emerged as a memory-efficient paradigm for fine-tuning large-scale models by avoiding backpropagation. However, ZO optimization has a fundamental tension between accuracy and query efficiency. In this work, we show that ZO optimization can be substantially improved by unifying two complementary principles: (i) a projection-based subspace view that reduces gradient estimation variance by exploiting the intrinsic low-rank structure of model updates, and (ii) Muon-style spectral optimization that applies gradient orthogonalization to extract informative spectral structure from noisy ZO gradients. These findings form a unified framework of subspace gradient orthogonalization, which we instantiate in a new method, ZO-Muon, admitting a natural interpretation as a low-rank Muon optimizer in the ZO setting. Extensive experiments on large language models (LLMs) and...

Related Articles

Llms

Study: LLMs Able to De-Anonymize User Accounts on Reddit, Hacker News & Other "Pseudonymous" Platforms; Report Co-Author Expands, Advises

Advice from the study's co-author: "Be aware that it’s not any single post that identifies you, but the combination of small details acro...

Reddit - Artificial Intelligence · 1 min ·
Machine Learning

[D] Best websites for pytorch/numpy interviews

Hello, I’m at the last year of my PHD and I’m starting to prepare interviews. I’m mainly aiming at applied scientist/research engineer or...

Reddit - Machine Learning · 1 min ·
Llms

[P] Remote sensing foundation models made easy to use.

This project enables the idea of tasking remote sensing models to acquire embeddings like we task satellites to acquire data! https://git...

Reddit - Machine Learning · 1 min ·
Machine Learning

Can AI truly be creative?

AI has no imagination. “Creativity is the ability to generate novel and valuable ideas or works through the exercise of imagination” http...

Reddit - Artificial Intelligence · 1 min ·
More in Machine Learning: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime