Llms Machine Learning Generative Ai

[2602.19895] DSDR: Dual-Scale Diversity Regularization for Exploration in LLM Reasoning

arXiv - Machine Learning February 24, 2026 4 min read Article

Summary

The paper presents DSDR, a novel reinforcement learning framework aimed at enhancing exploration in large language model (LLM) reasoning by promoting dual-scale diversity in reasoning trajectories.

Why It Matters

As LLMs become increasingly integral in various applications, improving their reasoning capabilities is crucial. DSDR addresses the limitations of existing methods by fostering deeper exploration and more robust learning signals, which can lead to more accurate and reliable AI systems.

Key Takeaways

DSDR introduces a dual-scale approach to diversity in LLM reasoning.
It enhances exploration by promoting distinct solution modes and preventing entropy collapse.
The framework is supported by theoretical evidence ensuring optimal correctness.
Experiments show significant improvements in accuracy across multiple reasoning benchmarks.
Code availability encourages further research and application of the DSDR framework.

Computer Science > Machine Learning arXiv:2602.19895 (cs) [Submitted on 23 Feb 2026] Title:DSDR: Dual-Scale Diversity Regularization for Exploration in LLM Reasoning Authors:Zhongwei Wan, Yun Shen, Zhihao Dou, Donghao Zhou, Yu Zhang, Xin Wang, Hui Shen, Jing Xiong, Chaofan Tao, Zixuan Zhong, Peizhou Huang, Mi Zhang View a PDF of the paper titled DSDR: Dual-Scale Diversity Regularization for Exploration in LLM Reasoning, by Zhongwei Wan and 11 other authors View PDF HTML (experimental) Abstract:Reinforcement learning with verifiers (RLVR) is a central paradigm for improving large language model (LLM) reasoning, yet existing methods often suffer from limited exploration. Policies tend to collapse onto a few reasoning patterns and prematurely stop deep exploration, while conventional entropy regularization introduces only local stochasticity and fails to induce meaningful path-level diversity, leading to weak and unstable learning signals in group-based policy optimization. We propose DSDR, a Dual-Scale Diversity Regularization reinforcement learning framework that decomposes diversity in LLM reasoning into global and coupling components. Globally, DSDR promotes diversity among correct reasoning trajectories to explore distinct solution modes. Locally, it applies a length-invariant, token-level entropy regularization restricted to correct trajectories, preventing entropy collapse within each mode while preserving correctness. The two scales are coupled through a global-to-loc...

Read Original Article

[2602.19895] DSDR: Dual-Scale Diversity Regularization for Exploration in LLM Reasoning

Summary

Why It Matters

Key Takeaways

Related Articles

Hackers Are Posting the Claude Code Leak With Bonus Malware | WIRED

People anxious about deviating from what AI tells them to do?

ChatGPT on trial: A landmark test of AI liability in the practice of law

What if Claude purposefully made its own code leakable so that it would get leaked

No comments

Stay updated with AI News