[2602.13112] AdaGrad-Diff: A New Version of the Adaptive Gradient Algorithm
Summary
The paper introduces AdaGrad-Diff, an adaptive gradient algorithm that improves upon the traditional AdaGrad by adjusting the stepsize based on the cumulative squared norms of gradient differences, enhancing robustness in optimization tasks.
Why It Matters
This research addresses the common challenge of sensitivity in gradient-based optimization methods, particularly in machine learning. By proposing a new adaptive method, it offers potential improvements in training models, which can lead to better performance in various applications.
Key Takeaways
- AdaGrad-Diff adapts stepsize based on gradient differences, not just norms.
- This method reduces unnecessary stepsize damping during stable iterations.
- Numerical experiments show AdaGrad-Diff's robustness compared to traditional AdaGrad.
- The approach can enhance performance in machine learning tasks.
- It addresses the need for less manual tuning in gradient methods.
Statistics > Machine Learning arXiv:2602.13112 (stat) [Submitted on 13 Feb 2026] Title:AdaGrad-Diff: A New Version of the Adaptive Gradient Algorithm Authors:Matia Bojovic, Saverio Salzo, Massimiliano Pontil View a PDF of the paper titled AdaGrad-Diff: A New Version of the Adaptive Gradient Algorithm, by Matia Bojovic and 2 other authors View PDF HTML (experimental) Abstract:Vanilla gradient methods are often highly sensitive to the choice of stepsize, which typically requires manual tuning. Adaptive methods alleviate this issue and have therefore become widely used. Among them, AdaGrad has been particularly influential. In this paper, we propose an AdaGrad-style adaptive method in which the adaptation is driven by the cumulative squared norms of successive gradient differences rather than gradient norms themselves. The key idea is that when gradients vary little across iterations, the stepsize is not unnecessarily reduced, while significant gradient fluctuations, reflecting curvature or instability, lead to automatic stepsize damping. Numerical experiments demonstrate that the proposed method is more robust than AdaGrad in several practically relevant settings. Comments: Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC) Cite as: arXiv:2602.13112 [stat.ML] (or arXiv:2602.13112v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2602.13112 Focus to learn more arXiv-issued DOI via DataCite (pending registration) S...