[2505.09655] DRA-GRPO: Your GRPO Needs to Know Diverse Reasoning Paths for Mathematical Reasoning

[2505.09655] DRA-GRPO: Your GRPO Needs to Know Diverse Reasoning Paths for Mathematical Reasoning

arXiv - Machine Learning 4 min read

About this article

Abstract page for arXiv paper 2505.09655: DRA-GRPO: Your GRPO Needs to Know Diverse Reasoning Paths for Mathematical Reasoning

Computer Science > Computation and Language arXiv:2505.09655 (cs) [Submitted on 14 May 2025 (v1), last revised 1 Mar 2026 (this version, v4)] Title:DRA-GRPO: Your GRPO Needs to Know Diverse Reasoning Paths for Mathematical Reasoning Authors:Xiwen Chen, Wenhui Zhu, Peijie Qiu, Xuanzhao Dong, Hao Wang, Haiyu Wu, Huayu Li, Aristeidis Sotiras, Yalin Wang, Abolfazl Razi View a PDF of the paper titled DRA-GRPO: Your GRPO Needs to Know Diverse Reasoning Paths for Mathematical Reasoning, by Xiwen Chen and 9 other authors View PDF HTML (experimental) Abstract:Post-training LLMs with Reinforcement Learning, specifically Group Relative Policy Optimization (GRPO), has emerged as a paradigm for enhancing mathematical reasoning. However, standard GRPO relies on scalar correctness rewards that are often non-injective with respect to semantic content: distinct reasoning paths receive identical rewards. This leads to a Diversity-Quality Inconsistency, where the policy collapses into a narrow set of dominant modes while ignoring equally valid but structurally novel strategies. To bridge this gap, we propose Diversity-aware Reward Adjustment (DRA), a theoretically grounded framework that calibrates the reward signal using the semantic density of sampled groups. By leveraging Submodular Mutual Information (SMI), DRA implements an Inverse Propensity Scoring (IPS) mechanism that effectively de-biases the gradient estimation. This creates a repulsive force against redundancy, driving the policy ...

Originally published on March 03, 2026. Curated by AI News.

Related Articles

Claude Mythos and misguided open-weight fearmongering
Llms

Claude Mythos and misguided open-weight fearmongering

AI Tools & Products · 9 min ·
Llms

Anthropic Agrees to Rent CoreWeave AI Capacity to Power Claude

AI Tools & Products · 1 min ·
CoreWeave strikes a deal to power Anthropic's Claude AI models — and the stock surges 12%
Llms

CoreWeave strikes a deal to power Anthropic's Claude AI models — and the stock surges 12%

AI Tools & Products · 3 min ·
Walmart’s AI Push Links Gemini App Experience With U.S. Manufacturing Shift
Llms

Walmart’s AI Push Links Gemini App Experience With U.S. Manufacturing Shift

AI Tools & Products · 6 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime