[2601.19657] One Token Is Enough: Improving Diffusion Language Models with a Sink Token
Summary
The paper discusses a novel approach to improve Diffusion Language Models (DLMs) by introducing a 'sink token' that stabilizes attention mechanisms, enhancing model performance and robustness.
Why It Matters
As DLMs gain traction for their parallel text generation capabilities, addressing their inherent instability is crucial for advancing AI language models. This research provides a practical solution that could lead to more reliable applications in natural language processing.
Key Takeaways
- Diffusion Language Models face instability due to the moving sink phenomenon.
- Introducing a dedicated sink token can stabilize attention mechanisms.
- The effectiveness of the sink token is independent of its position in the model.
- This approach enhances the robustness of DLMs in text generation tasks.
- The proposed method is simple yet significantly improves model performance.
Computer Science > Computation and Language arXiv:2601.19657 (cs) [Submitted on 27 Jan 2026 (v1), last revised 20 Feb 2026 (this version, v3)] Title:One Token Is Enough: Improving Diffusion Language Models with a Sink Token Authors:Zihou Zhang, Zheyong Xie, Li Zhong, Haifeng Liu, Shaosheng Cao View a PDF of the paper titled One Token Is Enough: Improving Diffusion Language Models with a Sink Token, by Zihou Zhang and 4 other authors View PDF HTML (experimental) Abstract:Diffusion Language Models (DLMs) have emerged as a compelling alternative to autoregressive approaches, enabling parallel text generation with competitive performance. Despite these advantages, there is a critical instability in DLMs: the moving sink phenomenon. Our analysis indicates that sink tokens exhibit low-norm representations in the Transformer's value space, and that the moving sink phenomenon serves as a protective mechanism in DLMs to prevent excessive information mixing. However, their unpredictable positions across diffusion steps undermine inference robustness. To resolve this, we propose a simple but effective extra sink token implemented via a modified attention mask. Specifically, we introduce a special token constrained to attend solely to itself, while remaining globally visible to all other tokens. Experimental results demonstrate that introducing a single extra token stabilizes attention sinks, substantially improving model performance. Crucially, further analysis confirms that the effe...