[2509.26209] Diversity-Incentivized Exploration for Versatile Reasoning
Summary
The paper presents DIVER, a framework for enhancing reasoning in Large Language Models through diversity-incentivized exploration, addressing challenges in reinforcement learning.
Why It Matters
This research is significant as it tackles the limitations of existing reinforcement learning methods in reasoning tasks, particularly in terms of exploration efficiency. By introducing a framework that leverages global sequence-level diversity, it offers a promising approach to improve the reasoning capabilities of AI models, which is crucial for advancing AI applications in various domains.
Key Takeaways
- DIVER framework enhances reasoning in LLMs by incentivizing exploration.
- Strong correlation found between global diversity and reasoning capacity.
- Introduces intrinsic rewards to promote exploration in structured spaces.
- Outperforms existing RLVR methods in diverse evaluation tasks.
- Provides code for implementation, promoting accessibility and further research.
Computer Science > Artificial Intelligence arXiv:2509.26209 (cs) [Submitted on 30 Sep 2025 (v1), last revised 21 Feb 2026 (this version, v2)] Title:Diversity-Incentivized Exploration for Versatile Reasoning Authors:Zican Hu, Shilin Zhang, Yafu Li, Jianhao Yan, Xuyang Hu, Leyang Cui, Xiaoye Qu, Chunlin Chen, Yu Cheng, Zhi Wang View a PDF of the paper titled Diversity-Incentivized Exploration for Versatile Reasoning, by Zican Hu and 9 other authors View PDF HTML (experimental) Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a crucial paradigm for incentivizing reasoning capabilities in Large Language Models (LLMs). Due to vast state-action spaces and reward sparsity in reasoning tasks, existing methods often struggle with deficient exploration and poor sample efficiency. In the paper, we propose \textbf{DIVER} (\textbf{D}iversity-\textbf{I}ncentivized Exploration for \textbf{V}ersatil\textbf{E} \textbf{R}easoning), an innovative framework that highlights the pivotal role of global sequence-level diversity to incentivize deep exploration for versatile reasoning. We first conduct a primary empirical study to reveal a strong positive correlation between global diversity and reasoning capacity. Building on this insight, we introduce global diversity incentives as an intrinsic reward to promote deep exploration in a semantically structured space. Incorporating the intrinsic reward, we develop a potential-based reward shaping mechanism to preserve opt...