[2604.01622] Expert-Choice Routing Enables Adaptive Computation in Diffusion Language Models
About this article
Abstract page for arXiv paper 2604.01622: Expert-Choice Routing Enables Adaptive Computation in Diffusion Language Models
Computer Science > Machine Learning arXiv:2604.01622 (cs) [Submitted on 2 Apr 2026] Title:Expert-Choice Routing Enables Adaptive Computation in Diffusion Language Models Authors:Shuibai Zhang, Caspian Zhuang, Chihan Cui, Zhihan Yang, Fred Zhangzhi Peng, Yanxin Zhang, Haoyue Bai, Zack Jia, Yang Zhou, Guanhua Chen, Ming Liu View a PDF of the paper titled Expert-Choice Routing Enables Adaptive Computation in Diffusion Language Models, by Shuibai Zhang and 10 other authors View PDF HTML (experimental) Abstract:Diffusion language models (DLMs) enable parallel, non-autoregressive text generation, yet existing DLM mixture-of-experts (MoE) models inherit token-choice (TC) routing from autoregressive systems, leading to load imbalance and rigid computation allocation. We show that expert-choice (EC) routing is a better fit for DLMs: it provides deterministic load balancing by design, yielding higher throughput and faster convergence than TC. Building on the property that EC capacity is externally controllable, we introduce timestep-dependent expert capacity, which varies expert allocation according to the denoising step. We find that allocating more capacity to low-mask-ratio steps consistently achieves the best performance under matched FLOPs, and provide a mechanistic explanation: tokens in low-mask-ratio contexts exhibit an order-of-magnitude higher learning efficiency, so concentrating compute on these steps yields the largest marginal return. Finally, we show that existing pre...