[2507.08838] wd1: Weighted Policy Optimization for Reasoning in Diffusion Language Models
Summary
The paper presents wd1, a novel approach for optimizing reasoning in diffusion language models using reinforcement learning, demonstrating significant accuracy improvements with lower computational costs.
Why It Matters
As large language models (LLMs) become increasingly integral to AI applications, enhancing their reasoning capabilities is crucial. This research addresses the computational challenges in reinforcement learning for LLMs, offering a more efficient method that could lead to broader applications and improved performance in AI tasks.
Key Takeaways
- Introduces wd1, a ratio-free policy optimization method for LLMs.
- Achieves up to 59% improvement in accuracy over previous methods.
- Reduces computational overhead associated with traditional RL approaches.
- Extends to wd1++, achieving state-of-the-art performance on math tasks.
- Demonstrates theoretical soundness through energy-guided training.
Computer Science > Machine Learning arXiv:2507.08838 (cs) [Submitted on 7 Jul 2025 (v1), last revised 14 Feb 2026 (this version, v2)] Title:wd1: Weighted Policy Optimization for Reasoning in Diffusion Language Models Authors:Xiaohang Tang, Rares Dolga, Sangwoong Yoon, Ilija Bogunovic View a PDF of the paper titled wd1: Weighted Policy Optimization for Reasoning in Diffusion Language Models, by Xiaohang Tang and 3 other authors View PDF HTML (experimental) Abstract:Improving the reasoning capabilities of diffusion-based large language models (dLLMs) through reinforcement learning (RL) remains an open problem. The intractability of dLLMs likelihood function necessitates approximating the current, old, and reference policy likelihoods at each policy optimization step. This reliance introduces additional computational overhead, and can lead to large variance and estimation error in RL objective -- particularly in computing the policy ratio for importance sampling. To mitigate these issues, we introduce wd1, a novel ratio-free policy optimization approach that reformulates the RL objective as a weighted log-likelihood, requiring only a single approximation for the current parametrized policy likelihood. We formally show that our proposed method can be interpreted as energy-guided discrete diffusion training combined with negative sample unlearning, thereby confirming its theoretical soundness. In experiments on LLaDA-8B model, wd1 outperforms diffusion-based GRPO (d1) while req...