[2503.00229] Armijo Line-search Can Make (Stochastic) Gradient Descent Provably Faster
Summary
The paper discusses how the Armijo line-search method can enhance the convergence speed of stochastic gradient descent (SGD) for both convex and non-convex functions, providing theoretical improvements over traditional step-size methods.
Why It Matters
This research is significant as it presents a method to optimize gradient descent, a fundamental algorithm in machine learning. By improving convergence rates, it can lead to faster training times and more efficient algorithms, which is crucial for applications in AI and data science.
Key Takeaways
- Armijo line-search can significantly speed up convergence in stochastic gradient descent.
- The method adapts to local smoothness, eliminating the need for a global smoothness constant.
- For convex objectives, it achieves linear convergence rates, outperforming traditional methods.
- The approach is applicable to both convex and non-convex functions, enhancing its utility in various machine learning scenarios.
- Theoretical proofs support the effectiveness of this method in improving gradient descent performance.
Computer Science > Machine Learning arXiv:2503.00229 (cs) [Submitted on 28 Feb 2025 (v1), last revised 24 Feb 2026 (this version, v4)] Title:Armijo Line-search Can Make (Stochastic) Gradient Descent Provably Faster Authors:Sharan Vaswani, Reza Babanezhad View a PDF of the paper titled Armijo Line-search Can Make (Stochastic) Gradient Descent Provably Faster, by Sharan Vaswani and 1 other authors View PDF HTML (experimental) Abstract:Armijo line-search (Armijo-LS) is a standard method to set the step-size for gradient descent (GD). For smooth functions, Armijo-LS alleviates the need to know the global smoothness constant L and adapts to the ``local'' smoothness, enabling GD to converge faster. Existing theoretical analyses show that GD with Armijo-LS (GD-LS) can result in constant factor improvements over GD with a 1/L step-size (denoted as GD(1/L)). We strengthen these results and show that if the objective function satisfies a certain non-uniform smoothness condition, GD-LS can result in a faster convergence rate than GD(1/L). In particular, we prove that for convex objectives corresponding to logistic regression and multi-class classification, GD-LS can converge to the optimum at a linear rate, and hence improves over the sublinear convergence of GD(1/L). Furthermore, for non-convex objectives satisfying gradient domination (e.g., those corresponding to the softmax policy gradient in RL or generalized linear models with a logistic link function), GD-LS can match the fast...