[2602.14280] Fast Compute for ML Optimization
Summary
The paper presents the Scale Mixture EM (SM-EM) algorithm for optimizing machine learning losses, demonstrating significant performance improvements over traditional methods like Adam.
Why It Matters
As machine learning models become increasingly complex, efficient optimization methods are crucial. The SM-EM algorithm offers a novel approach that reduces the need for manual tuning, potentially streamlining workflows in ML development and enhancing model performance.
Key Takeaways
- SM-EM algorithm achieves up to 13x lower final loss compared to tuned Adam.
- The algorithm eliminates the need for user-specified learning rates and momentum schedules.
- Utilizing Nesterov acceleration can enhance empirical convergence at the cost of monotonicity.
- Runtime reductions of 10x are possible through sharing sufficient statistics across penalty values.
- The approach is particularly effective for ill-conditioned logistic regression problems.
Statistics > Computation arXiv:2602.14280 (stat) [Submitted on 15 Feb 2026] Title:Fast Compute for ML Optimization Authors:Nick Polson, Vadim Sokolov View a PDF of the paper titled Fast Compute for ML Optimization, by Nick Polson and Vadim Sokolov View PDF HTML (experimental) Abstract:We study optimization for losses that admit a variance-mean scale-mixture representation. Under this representation, each EM iteration is a weighted least squares update in which latent variables determine observation and parameter weights; these play roles analogous to Adam's second-moment scaling and AdamW's weight decay, but are derived from the model. The resulting Scale Mixture EM (SM-EM) algorithm removes user-specified learning-rate and momentum schedules. On synthetic ill-conditioned logistic regression benchmarks with $p \in \{20, \ldots, 500\}$, SM-EM with Nesterov acceleration attains up to $13\times$ lower final loss than Adam tuned by learning-rate grid search. For a 40-point regularization path, sharing sufficient statistics across penalty values yields a $10\times$ runtime reduction relative to the same tuned-Adam protocol. For the base (non-accelerated) algorithm, EM monotonicity guarantees nonincreasing objective values; adding Nesterov extrapolation trades this guarantee for faster empirical convergence. Subjects: Computation (stat.CO); Machine Learning (cs.LG) Cite as: arXiv:2602.14280 [stat.CO] (or arXiv:2602.14280v1 [stat.CO] for this version) https://doi.org/10.48550...