[2501.07237] Gradient Compression Beyond Low-Rank: Wavelet Subspaces Compact Optimizer States
About this article
Abstract page for arXiv paper 2501.07237: Gradient Compression Beyond Low-Rank: Wavelet Subspaces Compact Optimizer States
Computer Science > Machine Learning arXiv:2501.07237 (cs) [Submitted on 13 Jan 2025 (v1), last revised 30 Mar 2026 (this version, v4)] Title:Gradient Compression Beyond Low-Rank: Wavelet Subspaces Compact Optimizer States Authors:Ziqing Wen, Ping Luo, Jiahuan Wang, Kun Yuan, Dongsheng Li, Tao Sun View a PDF of the paper titled Gradient Compression Beyond Low-Rank: Wavelet Subspaces Compact Optimizer States, by Ziqing Wen and 5 other authors View PDF HTML (experimental) Abstract:Large language models (LLMs) have shown impressive performance across a range of natural language processing tasks. However, their vast number of parameters introduces significant memory challenges during training, particularly when using memory-intensive optimizers like Adam. Existing memory-efficient algorithms often rely on techniques such as singular value decomposition projection or weight freezing. While these approaches help alleviate memory constraints, they generally produce suboptimal results compared to full-rank updates. In this paper, we investigate the memory-efficient method beyond low-rank training, proposing a novel solution called Gradient Wavelet Transform (GWT), which applies wavelet transforms to gradients in order to significantly reduce the memory requirements for maintaining optimizer states. We demonstrate that GWT can be seamlessly integrated with memory-intensive optimizers, enabling efficient training while maintaining performance. Through extensive experiments on both pr...