[2604.04230] Three Phases of Expert Routing: How Load Balance Evolves During Mixture-of-Experts Training
About this article
Abstract page for arXiv paper 2604.04230: Three Phases of Expert Routing: How Load Balance Evolves During Mixture-of-Experts Training
Computer Science > Machine Learning arXiv:2604.04230 (cs) [Submitted on 5 Apr 2026] Title:Three Phases of Expert Routing: How Load Balance Evolves During Mixture-of-Experts Training Authors:Charafeddine Mouzouni View a PDF of the paper titled Three Phases of Expert Routing: How Load Balance Evolves During Mixture-of-Experts Training, by Charafeddine Mouzouni View PDF HTML (experimental) Abstract:We model Mixture-of-Experts (MoE) token routing as a congestion game with a single effective parameter, the congestion coefficient gamma_eff, that quantifies the balance-quality tradeoff. Tracking gamma_eff across training checkpoints of two open-source MoE models, OLMoE-1B-7B (20 checkpoints, with dense sampling in the surge region) and OpenMoE-8B (6 checkpoints), reveals a three-phase trajectory: a surge phase where the router learns to balance load (gamma_eff: 14 to 36-39, peaking in the step 30K-40K region), a stabilization phase where experts specialize under steady balance (B_0: 2.4 to 2.3, steps 100K-400K), and a relaxation phase where the router trades balance for quality as experts differentiate (gamma_eff: 27 to 9, steps 400K-1.2M). This non-monotone trajectory, invisible to post-hoc analysis of converged models, reveals that early MoE training prioritizes balance while late training prioritizes quality. The theoretical framework is honest about its limits: the single-type equilibrium reduces to temperature-scaled softmax (held-out L1: MFG = 0.199 vs. softmax = 0.200). Th...