[2602.18002] Asynchronous Heavy-Tailed Optimization
Summary
This article explores asynchronous heavy-tailed optimization, addressing challenges in machine learning related to gradient noise and optimization stability.
Why It Matters
Asynchronous optimization techniques are crucial in machine learning, particularly for large-scale models. This research provides insights into improving stability and performance in the presence of heavy-tailed noise, which can enhance the efficiency of training algorithms across various tasks.
Key Takeaways
- Investigates the impact of heavy-tailed stochastic gradient noise on optimization processes.
- Proposes algorithmic modifications for delay-aware learning rate scheduling and delay compensation.
- Demonstrates that the new methods match synchronous optimization rates while improving delay tolerance.
- Empirical results show superior performance in accuracy/runtime trade-offs over existing methods.
- Enhances robustness to hyperparameters in both image and language tasks.
Computer Science > Machine Learning arXiv:2602.18002 (cs) [Submitted on 20 Feb 2026] Title:Asynchronous Heavy-Tailed Optimization Authors:Junfei Sun, Dixi Yao, Xuchen Gong, Tahseen Rabbani, Manzil Zaheer, Tian Li View a PDF of the paper titled Asynchronous Heavy-Tailed Optimization, by Junfei Sun and 5 other authors View PDF Abstract:Heavy-tailed stochastic gradient noise, commonly observed in transformer models, can destabilize the optimization process. Recent works mainly focus on developing and understanding approaches to address heavy-tailed noise in the centralized or distributed, synchronous setting, leaving the interactions between such noise and asynchronous optimization underexplored. In this work, we investigate two communication schemes that handle stragglers with asynchronous updates in the presence of heavy-tailed gradient noise. We propose and theoretically analyze algorithmic modifications based on delay-aware learning rate scheduling and delay compensation to enhance the performance of asynchronous algorithms. Our convergence guarantees under heavy-tailed noise match the rate of the synchronous counterparts and improve delay tolerance compared with existing asynchronous approaches. Empirically, our approaches outperform prior synchronous and asynchronous methods in terms of accuracy/runtime trade-offs and are more robust to hyperparameters in both image and language tasks. Comments: Subjects: Machine Learning (cs.LG) Cite as: arXiv:2602.18002 [cs.LG] (or ...