[2507.08838] wd1: Weighted Policy Optimization for Reasoning in Diffusion Language Models

[2507.08838] wd1: Weighted Policy Optimization for Reasoning in Diffusion Language Models

arXiv - AI 4 min read Article

Summary

The paper presents wd1, a novel approach for optimizing reasoning in diffusion language models using reinforcement learning, demonstrating significant accuracy improvements with lower computational costs.

Why It Matters

As large language models (LLMs) become increasingly integral to AI applications, enhancing their reasoning capabilities is crucial. This research addresses the computational challenges in reinforcement learning for LLMs, offering a more efficient method that could lead to broader applications and improved performance in AI tasks.

Key Takeaways

  • Introduces wd1, a ratio-free policy optimization method for LLMs.
  • Achieves up to 59% improvement in accuracy over previous methods.
  • Reduces computational overhead associated with traditional RL approaches.
  • Extends to wd1++, achieving state-of-the-art performance on math tasks.
  • Demonstrates theoretical soundness through energy-guided training.

Computer Science > Machine Learning arXiv:2507.08838 (cs) [Submitted on 7 Jul 2025 (v1), last revised 14 Feb 2026 (this version, v2)] Title:wd1: Weighted Policy Optimization for Reasoning in Diffusion Language Models Authors:Xiaohang Tang, Rares Dolga, Sangwoong Yoon, Ilija Bogunovic View a PDF of the paper titled wd1: Weighted Policy Optimization for Reasoning in Diffusion Language Models, by Xiaohang Tang and 3 other authors View PDF HTML (experimental) Abstract:Improving the reasoning capabilities of diffusion-based large language models (dLLMs) through reinforcement learning (RL) remains an open problem. The intractability of dLLMs likelihood function necessitates approximating the current, old, and reference policy likelihoods at each policy optimization step. This reliance introduces additional computational overhead, and can lead to large variance and estimation error in RL objective -- particularly in computing the policy ratio for importance sampling. To mitigate these issues, we introduce wd1, a novel ratio-free policy optimization approach that reformulates the RL objective as a weighted log-likelihood, requiring only a single approximation for the current parametrized policy likelihood. We formally show that our proposed method can be interpreted as energy-guided discrete diffusion training combined with negative sample unlearning, thereby confirming its theoretical soundness. In experiments on LLaDA-8B model, wd1 outperforms diffusion-based GRPO (d1) while req...

Related Articles

Llms

Stop Overcomplicating AI Workflows. This Is the Simple Framework

I’ve been working on building an agentic AI workflow system for business use cases and one thing became very clear very quickly. This is ...

Reddit - Artificial Intelligence · 1 min ·
Llms

Lemonade 10.1 released for latest improvements for local LLMs on AMD GPUs & NPUs

submitted by /u/Fcking_Chuck [link] [comments]

Reddit - Artificial Intelligence · 1 min ·
Llms

The Jose robot at the airport is just a trained parrot

Saw the news about Jose, the AI humanoid greeting passengers in California, speaking 50+ languages. Everyone's impressed by the language ...

Reddit - Artificial Intelligence · 1 min ·
Llms

[D] thoughts on current community moving away from heavy math?

I don't know about how you guys feel but even before LLM started, many papers are already leaning on empirical findings, architecture des...

Reddit - Machine Learning · 1 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime