[2602.24283] Taming Momentum: Rethinking Optimizer States Through Low-Rank Approximation
About this article
Abstract page for arXiv paper 2602.24283: Taming Momentum: Rethinking Optimizer States Through Low-Rank Approximation
Computer Science > Machine Learning arXiv:2602.24283 (cs) [Submitted on 27 Feb 2026] Title:Taming Momentum: Rethinking Optimizer States Through Low-Rank Approximation Authors:Zhengbo Wang, Jian Liang, Ran He, Zilei Wang, Tieniu Tan View a PDF of the paper titled Taming Momentum: Rethinking Optimizer States Through Low-Rank Approximation, by Zhengbo Wang and 4 other authors View PDF HTML (experimental) Abstract:Modern optimizers like Adam and Muon are central to training large language models, but their reliance on first- and second-order momenta introduces significant memory overhead, which constrains scalability and computational efficiency. In this work, we reframe the exponential moving average (EMA) used in these momenta as the training of a linear regressor via online gradient flow. Building on this equivalence, we introduce LoRA-Pre, a novel low-rank optimizer designed for efficient pre-training. Specifically, LoRA-Pre reduces the optimizer's memory footprint by decomposing the full momentum matrix into a compact low-rank subspace within the online linear learner, thereby maintaining optimization performance while improving memory efficiency. We empirically validate LoRA-Pre's efficacy by pre-training models from the Llama architecture family, scaling from 60M to 1B parameters. LoRA-Pre achieves the highest performance across all model sizes. Notably, LoRA-Pre demonstrates remarkable rank efficiency, achieving comparable or superior results using only 1/8 the rank of...