[2603.21862] Holistic Scaling Laws for Optimal Mixture-of-Experts Architecture Optimization
About this article
Abstract page for arXiv paper 2603.21862: Holistic Scaling Laws for Optimal Mixture-of-Experts Architecture Optimization
Computer Science > Machine Learning arXiv:2603.21862 (cs) [Submitted on 23 Mar 2026] Title:Holistic Scaling Laws for Optimal Mixture-of-Experts Architecture Optimization Authors:Weilin Wan, Jingtao Han, Weizhong Zhang, Cheng Jin View a PDF of the paper titled Holistic Scaling Laws for Optimal Mixture-of-Experts Architecture Optimization, by Weilin Wan and 3 other authors View PDF HTML (experimental) Abstract:Scaling laws for Large Language Models govern macroscopic resource allocation, yet translating them into precise Mixture-of-Experts (MoE) architectural configurations remains an open problem due to the combinatorially vast design space. Existing MoE scaling studies are constrained by experimental budgets to either augment scaling formulas with extra MoE variables, risking unreliable fits, or fix all non-MoE factors, ignoring global interactions. We propose a reusable framework for holistic MoE architectural optimization that bridges this gap. We first show that FLOPs per token alone is an inadequate fairness metric for MoE models because differing computational densities across layer types can inflate parameters without proportional compute cost, and establish a joint constraint triad of FLOPs per token, active parameters, and total parameters. We then reduce the 16-dimensional architectural search space to two sequential low-dimensional phases through algebraic constraints and a rank-preserving property of the hidden dimension. Validated across hundreds of MoE models ...