[2505.09655] DRA-GRPO: Your GRPO Needs to Know Diverse Reasoning Paths

[2505.09655] DRA-GRPO: Your GRPO Needs to Know Diverse Reasoning Paths for Mathematical Reasoning

arXiv - Machine Learning March 03, 2026 4 min read

About this article

Abstract page for arXiv paper 2505.09655: DRA-GRPO: Your GRPO Needs to Know Diverse Reasoning Paths for Mathematical Reasoning

Computer Science > Computation and Language arXiv:2505.09655 (cs) [Submitted on 14 May 2025 (v1), last revised 1 Mar 2026 (this version, v4)] Title:DRA-GRPO: Your GRPO Needs to Know Diverse Reasoning Paths for Mathematical Reasoning Authors:Xiwen Chen, Wenhui Zhu, Peijie Qiu, Xuanzhao Dong, Hao Wang, Haiyu Wu, Huayu Li, Aristeidis Sotiras, Yalin Wang, Abolfazl Razi View a PDF of the paper titled DRA-GRPO: Your GRPO Needs to Know Diverse Reasoning Paths for Mathematical Reasoning, by Xiwen Chen and 9 other authors View PDF HTML (experimental) Abstract:Post-training LLMs with Reinforcement Learning, specifically Group Relative Policy Optimization (GRPO), has emerged as a paradigm for enhancing mathematical reasoning. However, standard GRPO relies on scalar correctness rewards that are often non-injective with respect to semantic content: distinct reasoning paths receive identical rewards. This leads to a Diversity-Quality Inconsistency, where the policy collapses into a narrow set of dominant modes while ignoring equally valid but structurally novel strategies. To bridge this gap, we propose Diversity-aware Reward Adjustment (DRA), a theoretically grounded framework that calibrates the reward signal using the semantic density of sampled groups. By leveraging Submodular Mutual Information (SMI), DRA implements an Inverse Propensity Scoring (IPS) mechanism that effectively de-biases the gradient estimation. This creates a repulsive force against redundancy, driving the policy ...

Originally published on March 03, 2026. Curated by AI News.

Llms

Claude Mythos and misguided open-weight fearmongering

AI Tools & Products · 9 min · about 3 hours ago

Llms

Anthropic Agrees to Rent CoreWeave AI Capacity to Power Claude

AI Tools & Products · 1 min · about 3 hours ago

Llms

CoreWeave strikes a deal to power Anthropic's Claude AI models — and the stock surges 12%

AI Tools & Products · 3 min · about 3 hours ago

Llms

Walmart’s AI Push Links Gemini App Experience With U.S. Manufacturing Shift

AI Tools & Products · 6 min · about 3 hours ago

[2505.09655] DRA-GRPO: Your GRPO Needs to Know Diverse Reasoning Paths for Mathematical Reasoning

About this article

Related Articles

Claude Mythos and misguided open-weight fearmongering

Anthropic Agrees to Rent CoreWeave AI Capacity to Power Claude

CoreWeave strikes a deal to power Anthropic's Claude AI models — and the stock surges 12%

Walmart’s AI Push Links Gemini App Experience With U.S. Manufacturing Shift

No comments

Stay updated with AI News