[2510.17206] Soft-Masked Diffusion Language Models
About this article
Abstract page for arXiv paper 2510.17206: Soft-Masked Diffusion Language Models
Computer Science > Machine Learning arXiv:2510.17206 (cs) [Submitted on 20 Oct 2025 (v1), last revised 2 Mar 2026 (this version, v2)] Title:Soft-Masked Diffusion Language Models Authors:Michael Hersche, Samuel Moor-Smith, Thomas Hofmann, Abbas Rahimi View a PDF of the paper titled Soft-Masked Diffusion Language Models, by Michael Hersche and 3 other authors View PDF Abstract:Diffusion models have demonstrated strong potential in language modeling, offering various advantages over traditional autoregressive approaches. Their ability to generate and revise entire responses in parallel enables faster generation and built-in self-correction mechanisms. Most modern diffusion-based language models employ masked diffusion, where decoding involves iteratively processing masked tokens based on a binary decision: either retaining the mask or replacing it with the predicted token. However, this binary choice discards valuable predictive information when the mask is retained. To address this limitation, we introduce soft-masking (SM), a novel method that dynamically blends the embedding of the mask token with the embeddings of the top-k predicted tokens from the previous decoding step, for each retained mask. This provides the model with a more informative prior, preserving context from earlier computations and allowing partial information about masked tokens to propagate beyond a single step. We propose a training methodology that efficiently adapts masked diffusion language models t...