[2602.19895] DSDR: Dual-Scale Diversity Regularization for Exploration in LLM Reasoning
Summary
The paper presents DSDR, a novel reinforcement learning framework aimed at enhancing exploration in large language model (LLM) reasoning by promoting dual-scale diversity in reasoning trajectories.
Why It Matters
As LLMs become increasingly integral in various applications, improving their reasoning capabilities is crucial. DSDR addresses the limitations of existing methods by fostering deeper exploration and more robust learning signals, which can lead to more accurate and reliable AI systems.
Key Takeaways
- DSDR introduces a dual-scale approach to diversity in LLM reasoning.
- It enhances exploration by promoting distinct solution modes and preventing entropy collapse.
- The framework is supported by theoretical evidence ensuring optimal correctness.
- Experiments show significant improvements in accuracy across multiple reasoning benchmarks.
- Code availability encourages further research and application of the DSDR framework.
Computer Science > Machine Learning arXiv:2602.19895 (cs) [Submitted on 23 Feb 2026] Title:DSDR: Dual-Scale Diversity Regularization for Exploration in LLM Reasoning Authors:Zhongwei Wan, Yun Shen, Zhihao Dou, Donghao Zhou, Yu Zhang, Xin Wang, Hui Shen, Jing Xiong, Chaofan Tao, Zixuan Zhong, Peizhou Huang, Mi Zhang View a PDF of the paper titled DSDR: Dual-Scale Diversity Regularization for Exploration in LLM Reasoning, by Zhongwei Wan and 11 other authors View PDF HTML (experimental) Abstract:Reinforcement learning with verifiers (RLVR) is a central paradigm for improving large language model (LLM) reasoning, yet existing methods often suffer from limited exploration. Policies tend to collapse onto a few reasoning patterns and prematurely stop deep exploration, while conventional entropy regularization introduces only local stochasticity and fails to induce meaningful path-level diversity, leading to weak and unstable learning signals in group-based policy optimization. We propose DSDR, a Dual-Scale Diversity Regularization reinforcement learning framework that decomposes diversity in LLM reasoning into global and coupling components. Globally, DSDR promotes diversity among correct reasoning trajectories to explore distinct solution modes. Locally, it applies a length-invariant, token-level entropy regularization restricted to correct trajectories, preventing entropy collapse within each mode while preserving correctness. The two scales are coupled through a global-to-loc...