[2510.08233] Enhancing Reasoning for Diffusion LLMs via Distribution Matching Policy Optimization
Summary
This paper presents Distribution Matching Policy Optimization (DMPO), a novel reinforcement learning method aimed at enhancing reasoning in diffusion large language models (dLLMs), achieving significant performance improvements over existing models.
Why It Matters
As diffusion LLMs emerge as viable alternatives to traditional autoregressive models, optimizing their reasoning capabilities is crucial for advancing AI applications. This research addresses a gap in reinforcement learning techniques tailored for dLLMs, potentially leading to more efficient and effective AI systems.
Key Takeaways
- DMPO enhances reasoning capabilities in diffusion LLMs through a unique reinforcement learning approach.
- The method achieves up to 54.3% accuracy improvement over state-of-the-art baselines.
- DMPO addresses challenges associated with small training batch sizes effectively.
- The research highlights the importance of distribution matching in optimizing AI model performance.
- Code for DMPO is publicly available, promoting further research and application.
Computer Science > Machine Learning arXiv:2510.08233 (cs) [Submitted on 9 Oct 2025 (v1), last revised 22 Feb 2026 (this version, v2)] Title:Enhancing Reasoning for Diffusion LLMs via Distribution Matching Policy Optimization Authors:Yuchen Zhu, Wei Guo, Jaemoo Choi, Petr Molodyk, Bo Yuan, Molei Tao, Yongxin Chen View a PDF of the paper titled Enhancing Reasoning for Diffusion LLMs via Distribution Matching Policy Optimization, by Yuchen Zhu and 6 other authors View PDF HTML (experimental) Abstract:Diffusion large language models (dLLMs) are promising alternatives to autoregressive large language models (AR-LLMs), as they potentially allow higher inference throughput. Reinforcement learning (RL) is a crucial component for dLLMs to achieve comparable performance with AR-LLMs on important tasks, such as reasoning. However, RL algorithms that are well-suited for dLLMs' unique characteristics have yet to be developed. This paper proposes Distribution Matching Policy Optimization (DMPO), a principled and theoretically grounded RL fine-tuning method specifically designed to enhance the reasoning capabilities of dLLMs by matching the dLLM policy distribution to the optimal, reward-tilted one through cross-entropy optimization. We identify a key challenge in the implementation with a small training batch size and propose several effective solutions through a novel weight baseline subtraction technique. DMPO exhibits superior performance on multiple reasoning benchmarks without supe...