[2602.19895] DSDR: Dual-Scale Diversity Regularization for Exploration in LLM Reasoning

[2602.19895] DSDR: Dual-Scale Diversity Regularization for Exploration in LLM Reasoning

arXiv - Machine Learning 4 min read Article

Summary

The paper presents DSDR, a novel reinforcement learning framework aimed at enhancing exploration in large language model (LLM) reasoning by promoting dual-scale diversity in reasoning trajectories.

Why It Matters

As LLMs become increasingly integral in various applications, improving their reasoning capabilities is crucial. DSDR addresses the limitations of existing methods by fostering deeper exploration and more robust learning signals, which can lead to more accurate and reliable AI systems.

Key Takeaways

  • DSDR introduces a dual-scale approach to diversity in LLM reasoning.
  • It enhances exploration by promoting distinct solution modes and preventing entropy collapse.
  • The framework is supported by theoretical evidence ensuring optimal correctness.
  • Experiments show significant improvements in accuracy across multiple reasoning benchmarks.
  • Code availability encourages further research and application of the DSDR framework.

Computer Science > Machine Learning arXiv:2602.19895 (cs) [Submitted on 23 Feb 2026] Title:DSDR: Dual-Scale Diversity Regularization for Exploration in LLM Reasoning Authors:Zhongwei Wan, Yun Shen, Zhihao Dou, Donghao Zhou, Yu Zhang, Xin Wang, Hui Shen, Jing Xiong, Chaofan Tao, Zixuan Zhong, Peizhou Huang, Mi Zhang View a PDF of the paper titled DSDR: Dual-Scale Diversity Regularization for Exploration in LLM Reasoning, by Zhongwei Wan and 11 other authors View PDF HTML (experimental) Abstract:Reinforcement learning with verifiers (RLVR) is a central paradigm for improving large language model (LLM) reasoning, yet existing methods often suffer from limited exploration. Policies tend to collapse onto a few reasoning patterns and prematurely stop deep exploration, while conventional entropy regularization introduces only local stochasticity and fails to induce meaningful path-level diversity, leading to weak and unstable learning signals in group-based policy optimization. We propose DSDR, a Dual-Scale Diversity Regularization reinforcement learning framework that decomposes diversity in LLM reasoning into global and coupling components. Globally, DSDR promotes diversity among correct reasoning trajectories to explore distinct solution modes. Locally, it applies a length-invariant, token-level entropy regularization restricted to correct trajectories, preventing entropy collapse within each mode while preserving correctness. The two scales are coupled through a global-to-loc...

Related Articles

Hackers Are Posting the Claude Code Leak With Bonus Malware | WIRED
Llms

Hackers Are Posting the Claude Code Leak With Bonus Malware | WIRED

Plus: The FBI says a recent hack of its wiretap tools poses a national security risk, attackers stole Cisco source code as part of an ong...

Wired - AI · 9 min ·
Llms

People anxious about deviating from what AI tells them to do?

My friend came over yesterday to dye her hair. She had asked ChatGPT for the 'correct' way to do it. Chat told her to dye the ends first,...

Reddit - Artificial Intelligence · 1 min ·
Llms

ChatGPT on trial: A landmark test of AI liability in the practice of law

AI Tools & Products ·
Llms

What if Claude purposefully made its own code leakable so that it would get leaked

What if Claude leaked itself by socially and architecturally engineering itself to be leaked by a dumb human submitted by /u/smurfcsgoawp...

Reddit - Artificial Intelligence · 1 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime