[2603.20632] Optimal low-rank stochastic gradient estimation for LLM training
About this article
Abstract page for arXiv paper 2603.20632: Optimal low-rank stochastic gradient estimation for LLM training
Computer Science > Machine Learning arXiv:2603.20632 (cs) [Submitted on 21 Mar 2026] Title:Optimal low-rank stochastic gradient estimation for LLM training Authors:Zehao Li, Tao Ren, Zishi Zhang, Xi Chen, Yijie Peng View a PDF of the paper titled Optimal low-rank stochastic gradient estimation for LLM training, by Zehao Li and 4 other authors View PDF HTML (experimental) Abstract:Large language model (LLM) training is often bottlenecked by memory constraints and stochastic gradient noise in extremely high-dimensional parameter spaces. Motivated by empirical evidence that many LLM gradient matrices are effectively low-rank during training, we present an unbiased, memory-efficient, low-rank matrix estimator with the lowest variance that is applicable across common stochastic gradient estimation paradigms. The core idea is to project a high-dimensional stochastic gradient estimator onto a random low-dimensional subspace and lift it back, reducing memory while keeping the estimator unbiased and controlling mean-squared error via an optimally designed projection distribution, including Haar--Stiefel projections. The projection distribution is derived by solving a constrained functional optimization problem, yielding an optimal random projector that guides algorithm design. Empirically, the resulting low-rank gradient estimators deliver both practical memory savings and improved training behavior. In RoBERTa-large fine-tuning, our method attains the lowest peak GPU memory among ...