[2603.21862] Holistic Scaling Laws for Optimal Mixture-of-Experts

[2603.21862] Holistic Scaling Laws for Optimal Mixture-of-Experts Architecture Optimization

arXiv - Machine Learning March 24, 2026 4 min read

About this article

Abstract page for arXiv paper 2603.21862: Holistic Scaling Laws for Optimal Mixture-of-Experts Architecture Optimization

Computer Science > Machine Learning arXiv:2603.21862 (cs) [Submitted on 23 Mar 2026] Title:Holistic Scaling Laws for Optimal Mixture-of-Experts Architecture Optimization Authors:Weilin Wan, Jingtao Han, Weizhong Zhang, Cheng Jin View a PDF of the paper titled Holistic Scaling Laws for Optimal Mixture-of-Experts Architecture Optimization, by Weilin Wan and 3 other authors View PDF HTML (experimental) Abstract:Scaling laws for Large Language Models govern macroscopic resource allocation, yet translating them into precise Mixture-of-Experts (MoE) architectural configurations remains an open problem due to the combinatorially vast design space. Existing MoE scaling studies are constrained by experimental budgets to either augment scaling formulas with extra MoE variables, risking unreliable fits, or fix all non-MoE factors, ignoring global interactions. We propose a reusable framework for holistic MoE architectural optimization that bridges this gap. We first show that FLOPs per token alone is an inadequate fairness metric for MoE models because differing computational densities across layer types can inflate parameters without proportional compute cost, and establish a joint constraint triad of FLOPs per token, active parameters, and total parameters. We then reduce the 16-dimensional architectural search space to two sequential low-dimensional phases through algebraic constraints and a rank-preserving property of the hidden dimension. Validated across hundreds of MoE models ...

Originally published on March 24, 2026. Curated by AI News.

Llms

Popular AI gateway startup LiteLLM ditches controversial startup Delve | TechCrunch

LiteLLM had obtained two security compliance certifications via Delve and fell victim to some horrific credential-stealing malware last w...

TechCrunch - AI · 3 min · about 1 hour ago

Llms

Von Hammerstein’s Ghost: What a Prussian General’s Officer Typology Can Teach Us About AI Misalignment

Greetings all - I've posted mostly in r/claudecode and r/aigamedev a couple of times previously. Working with CC for personal projects re...

Reddit - Artificial Intelligence · 1 min · about 1 hour ago

Llms

World models will be the next big thing, bye-bye LLMs

Was at Nvidia's GTC conference recently and honestly, it was one of the most eye-opening events I've attended in a while. There was a lot...

Reddit - Artificial Intelligence · 1 min · about 4 hours ago

Llms

we open sourced a tool that auto generates your AI agent context from your actual codebase, just hit 250 stars

hey everyone. been lurking here for a while and wanted to share something we been building. the problem: ai coding agents are only as goo...

Reddit - Artificial Intelligence · 1 min · about 4 hours ago

[2603.21862] Holistic Scaling Laws for Optimal Mixture-of-Experts Architecture Optimization

About this article

Related Articles

Popular AI gateway startup LiteLLM ditches controversial startup Delve | TechCrunch

Von Hammerstein’s Ghost: What a Prussian General’s Officer Typology Can Teach Us About AI Misalignment

World models will be the next big thing, bye-bye LLMs

we open sourced a tool that auto generates your AI agent context from your actual codebase, just hit 250 stars

No comments

Stay updated with AI News