[2512.10858] Scaling Behavior of Discrete Diffusion Language Models
Summary
This article explores the scaling behavior of discrete diffusion language models (DLMs) compared to autoregressive language models (ALMs), revealing significant differences in their performance based on noise types and training parameters.
Why It Matters
Understanding the scaling behavior of DLMs is crucial as they present a potential alternative to ALMs in language model training. This research highlights the efficiency of uniform diffusion models, which may influence future developments in machine learning and AI applications.
Key Takeaways
- DLMs require different amounts of data and compute compared to ALMs.
- Uniform diffusion models show greater efficiency in data-bound settings.
- Scaling behavior varies significantly with noise types in DLMs.
- The study confirms that uniform diffusion can be scaled up effectively.
- Understanding these models can lead to advancements in language model training.
Computer Science > Machine Learning arXiv:2512.10858 (cs) [Submitted on 11 Dec 2025 (v1), last revised 15 Feb 2026 (this version, v3)] Title:Scaling Behavior of Discrete Diffusion Language Models Authors:Dimitri von Rütte, Janis Fluri, Omead Pooladzandi, Bernhard Schölkopf, Thomas Hofmann, Antonio Orvieto View a PDF of the paper titled Scaling Behavior of Discrete Diffusion Language Models, by Dimitri von R\"utte and 5 other authors View PDF HTML (experimental) Abstract:Modern LLM pre-training consumes vast amounts of compute and training data, making the scaling behavior, or scaling laws, of different models a key distinguishing factor. Discrete diffusion language models (DLMs) have been proposed as an alternative to autoregressive language models (ALMs). However, their scaling behavior has not yet been fully explored, with prior work suggesting that they require more data and compute to match the performance of ALMs. We study the scaling behavior of DLMs on different noise types by smoothly interpolating between masked and uniform diffusion while paying close attention to crucial hyperparameters such as batch size and learning rate. Our experiments reveal that the scaling behavior of DLMs strongly depends on the noise type and is considerably different from ALMs. While all noise types converge to similar loss values in compute-bound scaling, we find that uniform diffusion requires more parameters and less data for compute-efficient training compared to masked diffusion, ...