[2602.14280] Fast Compute for ML Optimization

[2602.14280] Fast Compute for ML Optimization

arXiv - Machine Learning 3 min read Article

Summary

The paper presents the Scale Mixture EM (SM-EM) algorithm for optimizing machine learning losses, demonstrating significant performance improvements over traditional methods like Adam.

Why It Matters

As machine learning models become increasingly complex, efficient optimization methods are crucial. The SM-EM algorithm offers a novel approach that reduces the need for manual tuning, potentially streamlining workflows in ML development and enhancing model performance.

Key Takeaways

  • SM-EM algorithm achieves up to 13x lower final loss compared to tuned Adam.
  • The algorithm eliminates the need for user-specified learning rates and momentum schedules.
  • Utilizing Nesterov acceleration can enhance empirical convergence at the cost of monotonicity.
  • Runtime reductions of 10x are possible through sharing sufficient statistics across penalty values.
  • The approach is particularly effective for ill-conditioned logistic regression problems.

Statistics > Computation arXiv:2602.14280 (stat) [Submitted on 15 Feb 2026] Title:Fast Compute for ML Optimization Authors:Nick Polson, Vadim Sokolov View a PDF of the paper titled Fast Compute for ML Optimization, by Nick Polson and Vadim Sokolov View PDF HTML (experimental) Abstract:We study optimization for losses that admit a variance-mean scale-mixture representation. Under this representation, each EM iteration is a weighted least squares update in which latent variables determine observation and parameter weights; these play roles analogous to Adam's second-moment scaling and AdamW's weight decay, but are derived from the model. The resulting Scale Mixture EM (SM-EM) algorithm removes user-specified learning-rate and momentum schedules. On synthetic ill-conditioned logistic regression benchmarks with $p \in \{20, \ldots, 500\}$, SM-EM with Nesterov acceleration attains up to $13\times$ lower final loss than Adam tuned by learning-rate grid search. For a 40-point regularization path, sharing sufficient statistics across penalty values yields a $10\times$ runtime reduction relative to the same tuned-Adam protocol. For the base (non-accelerated) algorithm, EM monotonicity guarantees nonincreasing objective values; adding Nesterov extrapolation trades this guarantee for faster empirical convergence. Subjects: Computation (stat.CO); Machine Learning (cs.LG) Cite as: arXiv:2602.14280 [stat.CO]   (or arXiv:2602.14280v1 [stat.CO] for this version)   https://doi.org/10.48550...

Related Articles

Machine Learning

Danger Words - Where Words Are Weapons

Every profession has its danger words - small words that carry hidden judgements while pretending to be neutral. I learned to hear them w...

Reddit - Artificial Intelligence · 1 min ·
The Download: an exclusive Jeff VanderMeer story and AI models too scary to release | MIT Technology Review
Machine Learning

The Download: an exclusive Jeff VanderMeer story and AI models too scary to release | MIT Technology Review

OpenAI has joined Anthropic in restricting an AI model's release over security fears.

MIT Technology Review - AI · 4 min ·
Llms

What's your "When Language Model AI can do X, I'll be impressed"?

I have two at the top of my mind: When it can read musical notes. I will be mildly impressed when I can paste in a picture of musical not...

Reddit - Artificial Intelligence · 1 min ·
Meta’s New AI Asked for My Raw Health Data—and Gave Me Terrible Advice | WIRED
Machine Learning

Meta’s New AI Asked for My Raw Health Data—and Gave Me Terrible Advice | WIRED

Meta’s Muse Spark model offers to analyze users’ health data, including lab results. Beyond the obvious privacy risks, it’s not a capable...

Wired - AI · 9 min ·
More in Machine Learning: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime