Llms Machine Learning Ai Infrastructure Data Science

[2506.03725] Sign-SGD via Parameter-Free Optimization

arXiv - Machine Learning February 23, 2026 4 min read Article

Summary

This paper introduces a parameter-free optimization method for Sign-SGD, enhancing efficiency in training large language models by eliminating manual stepsize selection.

Why It Matters

As large language models become increasingly resource-intensive, optimizing their training processes is crucial. This research addresses a significant limitation in existing methods by proposing a parameter-free approach, which can lead to faster training times and reduced overhead, making advanced machine learning more accessible.

Key Takeaways

Introduces a parameter-free Sign-SGD optimizer for efficient training.
Eliminates the need for manual stepsize selection, reducing overhead.
Achieves performance comparable to tuned methods while speeding up training.
Incorporates momentum techniques and memory-efficient storage of gradients.
Demonstrates effectiveness on LLaMA and Swin Transformer models.

Computer Science > Machine Learning arXiv:2506.03725 (cs) [Submitted on 4 Jun 2025 (v1), last revised 20 Feb 2026 (this version, v4)] Title:Sign-SGD via Parameter-Free Optimization Authors:Daniil Medyakov, Sergey Stanko, Gleb Molodtsov, Philip Zmushko, Grigoriy Evseev, Egor Petrov, Aleksandr Beznosikov View a PDF of the paper titled Sign-SGD via Parameter-Free Optimization, by Daniil Medyakov and 6 other authors View PDF Abstract:Large language models have achieved major advances across domains, yet training them remains extremely resource-intensive. We revisit Sign-SGD, which serves both as a memory-efficient optimizer for single-node training and as a gradient compression mechanism for distributed learning. This paper addresses a central limitation: the effective stepsize cannot be determined a priori because it relies on unknown, problem-specific quantities. We present a parameter-free Sign-SGD that removes manual stepsize selection. We analyze the deterministic single-node case, and extend the method to stochastic single-node training and multi-node settings. We also incorporate the momentum technique into our algorithms and propose a memory-efficient variant that stores only gradient signs instead of full gradients. We evaluate our methods on pre-training LLaMA models (130M and 350M) and fine-tuning a Swin Transformer (28M). Across considered tasks, the proposed methods match the performance of tuned Sign-SGD and AdamW (grid-searched stepsizes with a cosine schedule),...

Read Original Article