[2511.19269] CDLM: Consistency Diffusion Language Models For Faster Sampling
Summary
The paper introduces Consistency Diffusion Language Models (CDLM), a method that accelerates inference in diffusion language models by reducing sampling steps and enabling KV caching, achieving significant latency improvements while maintaining accuracy.
Why It Matters
As language models become increasingly integral to various applications, optimizing their performance is crucial. CDLM addresses key bottlenecks in inference speed, making it a significant advancement for developers and researchers working with generative AI and natural language processing.
Key Takeaways
- CDLM reduces the number of sampling steps required in diffusion language models.
- The method allows for compatibility with KV caching, enhancing efficiency.
- Experiments show latency improvements of 3.6x to 14.5x without sacrificing accuracy.
- The approach integrates consistency modeling for better performance.
- Full training and evaluation code is made available for further research.
Computer Science > Machine Learning arXiv:2511.19269 (cs) [Submitted on 24 Nov 2025 (v1), last revised 20 Feb 2026 (this version, v2)] Title:CDLM: Consistency Diffusion Language Models For Faster Sampling Authors:Minseo Kim, Chenfeng Xu, Coleman Hooper, Harman Singh, Ben Athiwaratkun, Ce Zhang, Kurt Keutzer, Amir Gholami View a PDF of the paper titled CDLM: Consistency Diffusion Language Models For Faster Sampling, by Minseo Kim and 7 other authors View PDF HTML (experimental) Abstract:Diffusion Language Models (DLMs) offer a promising parallel generation paradigm but suffer from slow inference due to numerous refinement steps and the inability to use standard KV caching. We introduce CDLM (Consistency Diffusion Language Models), a training-based acceleration method that simultaneously tackles both bottlenecks. CDLM integrates consistency modeling to drastically reduce the number of required sampling steps by enabling multi-token finalization. Furthermore, we enforce a block-wise causal attention mask during fine-tuning, making the model fully compatible with KV caching. Experiments show CDLM achieves 3.6x-14.5x lower latency while maintaining competitive accuracy on math and coding tasks. The full training and evaluation code is available at this https URL. Comments: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL) Cite as: arXiv:2511.19269 [cs.LG] (or arXiv:2511.19269v2 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2511.19269 Focus to l...