[2602.19510] Less is More: Convergence Benefits of Fewer Data Weight Updates over Longer Horizon
Summary
This paper explores the convergence benefits of fewer data weight updates in machine learning, demonstrating that optimal update strategies can improve model training efficiency.
Why It Matters
Understanding the dynamics of data mixing and weight updates is crucial for developing more efficient machine learning models. This research provides insights that can lead to better optimization strategies, potentially enhancing model performance and reducing computational costs.
Key Takeaways
- Fewer data weight updates can lead to better convergence in machine learning models.
- The optimal number of inner update steps scales logarithmically with the update budget.
- Using a single update step can be detrimental in certain scenarios, highlighting the need for careful optimization.
Computer Science > Machine Learning arXiv:2602.19510 (cs) [Submitted on 23 Feb 2026] Title:Less is More: Convergence Benefits of Fewer Data Weight Updates over Longer Horizon Authors:Rudrajit Das, Neel Patel, Meisam Razaviyayn, Vahab Mirrokni View a PDF of the paper titled Less is More: Convergence Benefits of Fewer Data Weight Updates over Longer Horizon, by Rudrajit Das and 3 other authors View PDF HTML (experimental) Abstract:Data mixing--the strategic reweighting of training domains--is a critical component in training robust machine learning models. This problem is naturally formulated as a bilevel optimization task, where the outer loop optimizes domain weights to minimize validation loss, and the inner loop optimizes model parameters to minimize the weighted training loss. Classical bilevel optimization relies on hypergradients, which theoretically require the inner optimization to reach convergence. However, due to computational constraints, state-of-the-art methods use a finite, often small, number of inner update steps before updating the weights. The theoretical implications of this approximation are not well understood. In this work, we rigorously analyze the convergence behavior of data mixing with a finite number of inner steps $T$. We prove that the "greedy" practical approach of using $T=1$ can fail even in a simple quadratic example. Under a fixed parameter update budget $N$ and assuming the per-domain losses are strongly convex, we show that the optimal $...