[2602.23225] Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding?
Summary
This paper investigates why Diffusion Language Models (DLMs) often default to autoregressive decoding instead of utilizing their potential for parallel token generation. It proposes a new approach, NAP, to align training data with non-autoregressive methods for improved perfor...
Why It Matters
Understanding the limitations of DLMs in parallel decoding is crucial for advancing natural language processing technologies. This research highlights the importance of data alignment in model training, which could lead to more efficient language generation methods and better utilization of computational resources.
Key Takeaways
- DLMs often exhibit autoregressive behavior due to training data structure.
- Non-autoregressive generation can significantly reduce latency and improve performance.
- The proposed NAP approach enhances parallel decoding by aligning supervision with model capabilities.
- Performance gains increase with the level of parallelism in decoding.
- Revisiting training data and supervision methods is essential for optimizing DLMs.
Computer Science > Computation and Language arXiv:2602.23225 (cs) [Submitted on 26 Feb 2026] Title:Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding? Authors:Pengxiang Li, Dilxat Muhtar, Lu Yin, Tianlong Chen, Shiwei Liu View a PDF of the paper titled Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding?, by Pengxiang Li and 4 other authors View PDF HTML (experimental) Abstract:Diffusion Language Models (DLMs) are often advertised as enabling parallel token generation, yet practical fast DLMs frequently converge to left-to-right, autoregressive (AR)-like decoding dynamics. In contrast, genuinely non-AR generation is promising because it removes AR's sequential bottleneck, better exploiting parallel hardware to reduce synchronization/communication overhead and improve latency scaling with output length. We argue that a primary driver of AR-like decoding is a mismatch between DLM objectives and the highly sequential structure of widely used training data, including standard pretraining corpora and long chain-of-thought (CoT) supervision. Motivated by this diagnosis, we propose NAP (Non-Autoregressive Parallel DLMs), a proof-of-concept, data-centric approach that better aligns supervision with non-AR parallel decoding. NAP curates examples as multiple independent reasoning trajectories and couples them with a parallel-forced decoding strategy that encourages multi-token parallel updates. Across math re...