[2602.05630] Rewards as Labels: Revisiting RLVR from a Classification Perspective

[2602.05630] Rewards as Labels: Revisiting RLVR from a Classification Perspective

arXiv - Machine Learning 4 min read

About this article

Abstract page for arXiv paper 2602.05630: Rewards as Labels: Revisiting RLVR from a Classification Perspective

Computer Science > Machine Learning arXiv:2602.05630 (cs) [Submitted on 5 Feb 2026 (v1), last revised 4 Mar 2026 (this version, v2)] Title:Rewards as Labels: Revisiting RLVR from a Classification Perspective Authors:Zepeng Zhai, Meilin Chen, Jiaxuan Zhao, Junlang Qian, Lei Shen, Yuan Lu View a PDF of the paper titled Rewards as Labels: Revisiting RLVR from a Classification Perspective, by Zepeng Zhai and 5 other authors View PDF HTML (experimental) Abstract:Reinforcement Learning with Verifiable Rewards has recently advanced the capabilities of Large Language Models in complex reasoning tasks by providing explicit rule-based supervision. Among RLVR methods, GRPO and its variants have achieved strong empirical performance. Despite their success, we identify that they suffer from Gradient Misassignment in Positives and Gradient Domination in Negatives, which lead to inefficient and suboptimal policy updates. To address these issues, we propose Rewards as Labels (REAL), a novel framework that revisits verifiable rewards as categorical labels rather than scalar weights, thereby reformulating policy optimization as a classification problem. Building on this, we further introduce anchor logits to enhance policy learning. Our analysis reveals that REAL induces a monotonic and bounded gradient weighting, enabling balanced gradient allocation across rollouts and effectively mitigating the identified mismatches. Extensive experiments on mathematical reasoning benchmarks show that RE...

Originally published on March 05, 2026. Curated by AI News.

Related Articles

Llms

Started a video series on building an orchestration layer for LLM post-training [P]

Hi everyone! Context, motivation, a lot of yapping, feel free to skip to TL;DR. A while back I posted here asking [D] What framework do y...

Reddit - Machine Learning · 1 min ·
ChatGPT finally offers $100/month Pro plan
Llms

ChatGPT finally offers $100/month Pro plan

OpenAI announced on Thursday something that power users have been asking for: a $100/month plan. Previously, subscriptions jumped from $2...

TechCrunch - AI · 4 min ·
Anthropic says new Claude Mythos AI is too risky for public use
Llms

Anthropic says new Claude Mythos AI is too risky for public use

Dubbed Claude Mythos, the software is part of the Claude AI family, an artificial intelligence model that can act like a chatbot and AI a...

AI Tools & Products · 10 min ·
ChatGPT has a new $100 per month Pro subscription
Llms

ChatGPT has a new $100 per month Pro subscription

OpenAI has announced a new version of its ChatGPT Pro subscription that costs $100 per month. The new Pro tier offers "5x more" usage of ...

The Verge - AI · 4 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime