[2506.03725] Sign-SGD via Parameter-Free Optimization
Summary
This paper introduces a parameter-free optimization method for Sign-SGD, enhancing efficiency in training large language models by eliminating manual stepsize selection.
Why It Matters
As large language models become increasingly resource-intensive, optimizing their training processes is crucial. This research addresses a significant limitation in existing methods by proposing a parameter-free approach, which can lead to faster training times and reduced overhead, making advanced machine learning more accessible.
Key Takeaways
- Introduces a parameter-free Sign-SGD optimizer for efficient training.
- Eliminates the need for manual stepsize selection, reducing overhead.
- Achieves performance comparable to tuned methods while speeding up training.
- Incorporates momentum techniques and memory-efficient storage of gradients.
- Demonstrates effectiveness on LLaMA and Swin Transformer models.
Computer Science > Machine Learning arXiv:2506.03725 (cs) [Submitted on 4 Jun 2025 (v1), last revised 20 Feb 2026 (this version, v4)] Title:Sign-SGD via Parameter-Free Optimization Authors:Daniil Medyakov, Sergey Stanko, Gleb Molodtsov, Philip Zmushko, Grigoriy Evseev, Egor Petrov, Aleksandr Beznosikov View a PDF of the paper titled Sign-SGD via Parameter-Free Optimization, by Daniil Medyakov and 6 other authors View PDF Abstract:Large language models have achieved major advances across domains, yet training them remains extremely resource-intensive. We revisit Sign-SGD, which serves both as a memory-efficient optimizer for single-node training and as a gradient compression mechanism for distributed learning. This paper addresses a central limitation: the effective stepsize cannot be determined a priori because it relies on unknown, problem-specific quantities. We present a parameter-free Sign-SGD that removes manual stepsize selection. We analyze the deterministic single-node case, and extend the method to stochastic single-node training and multi-node settings. We also incorporate the momentum technique into our algorithms and propose a memory-efficient variant that stores only gradient signs instead of full gradients. We evaluate our methods on pre-training LLaMA models (130M and 350M) and fine-tuning a Swin Transformer (28M). Across considered tasks, the proposed methods match the performance of tuned Sign-SGD and AdamW (grid-searched stepsizes with a cosine schedule),...