[2602.21798] Excitation: Momentum For Experts
Summary
The paper introduces Excitation, a novel optimization framework aimed at enhancing learning in sparse architectures like Mixture-of-Experts (MoEs) by dynamically modulating updates based on expert utilization.
Why It Matters
Excitation addresses the limitations of traditional optimizers in deep learning models, particularly in MoEs, where it can improve convergence speed and model performance. This framework is significant for researchers and practitioners looking to optimize sparse architectures effectively, especially in resource-constrained environments.
Key Takeaways
- Excitation enhances learning in sparse architectures by modulating updates based on expert utilization.
- It resolves the 'structural confusion' issue in MoEs, allowing for stable training.
- The framework is optimizer-, domain-, and model-agnostic, requiring minimal integration effort.
- Excitation improves convergence speed and final performance across various tasks.
- It introduces no additional per-parameter states, making it suitable for memory-constrained settings.
Computer Science > Machine Learning arXiv:2602.21798 (cs) [Submitted on 25 Feb 2026] Title:Excitation: Momentum For Experts Authors:Sagi Shaier View a PDF of the paper titled Excitation: Momentum For Experts, by Sagi Shaier View PDF HTML (experimental) Abstract:We propose Excitation, a novel optimization framework designed to accelerate learning in sparse architectures such as Mixture-of-Experts (MoEs). Unlike traditional optimizers that treat all parameters uniformly, Excitation dynamically modulates updates using batch-level expert utilization. It introduces a competitive update dynamic that amplifies updates to highly-utilized experts and can selectively suppress low-utilization ones, effectively sharpening routing specialization. Notably, we identify a phenomenon of "structural confusion" in deep MoEs, where standard optimizers fail to establish functional signal paths; Excitation acts as a specialization catalyst, "rescuing" these models and enabling stable training where baselines remain trapped. Excitation is optimizer-, domain-, and model-agnostic, requires minimal integration effort, and introduces neither additional per-parameter optimizer state nor learnable parameters, making it highly viable for memory-constrained settings. Across language and vision tasks, Excitation consistently improves convergence speed and final performance in MoE models, indicating that active update modulation is a key mechanism for effective conditional computation. Subjects: Machine L...