[2602.15014] Scaling Beyond Masked Diffusion Language Models
Summary
This paper explores scaling laws in masked diffusion language models, revealing that they can be made more efficient and competitive against autoregressive models despite traditional perplexity metrics.
Why It Matters
The findings challenge the prevailing notion that masked diffusion is the superior approach in language modeling. By demonstrating that efficiency and practical sampling speed can outweigh perplexity scores, this research opens new avenues for model development and application in natural language processing.
Key Takeaways
- Masked diffusion models can achieve 12% more FLOPs efficiency with cross-entropy objectives.
- Perplexity may mislead comparisons across different diffusion families.
- Uniform-state diffusion models outperform autoregressive models on specific benchmarks despite worse perplexity.
- The study provides code and resources for further exploration of the findings.
- Scaling to 1.7B parameters reveals competitive performance of uniform-state diffusion.
Computer Science > Machine Learning arXiv:2602.15014 (cs) [Submitted on 16 Feb 2026] Title:Scaling Beyond Masked Diffusion Language Models Authors:Subham Sekhar Sahoo, Jean-Marie Lemercier, Zhihan Yang, Justin Deschenaux, Jingyu Liu, John Thickstun, Ante Jukic View a PDF of the paper titled Scaling Beyond Masked Diffusion Language Models, by Subham Sekhar Sahoo and 6 other authors View PDF HTML (experimental) Abstract:Diffusion language models are a promising alternative to autoregressive models due to their potential for faster generation. Among discrete diffusion approaches, Masked diffusion currently dominates, largely driven by strong perplexity on language modeling benchmarks. In this work, we present the first scaling law study of uniform-state and interpolating discrete diffusion methods. We also show that Masked diffusion models can be made approximately 12% more FLOPs-efficient when trained with a simple cross-entropy objective. We find that perplexity is informative within a diffusion family but can be misleading across families, where models with worse likelihood scaling may be preferable due to faster and more practical sampling, as reflected by the speed-quality Pareto frontier. These results challenge the view that Masked diffusion is categorically the future of diffusion language modeling and that perplexity alone suffices for cross-algorithm comparison. Scaling all methods to 1.7B parameters, we show that uniform-state diffusion remains competitive on likel...