[2602.05630] Rewards as Labels: Revisiting RLVR from a Classification

[2602.05630] Rewards as Labels: Revisiting RLVR from a Classification Perspective

arXiv - Machine Learning March 05, 2026 4 min read

About this article

Abstract page for arXiv paper 2602.05630: Rewards as Labels: Revisiting RLVR from a Classification Perspective

Computer Science > Machine Learning arXiv:2602.05630 (cs) [Submitted on 5 Feb 2026 (v1), last revised 4 Mar 2026 (this version, v2)] Title:Rewards as Labels: Revisiting RLVR from a Classification Perspective Authors:Zepeng Zhai, Meilin Chen, Jiaxuan Zhao, Junlang Qian, Lei Shen, Yuan Lu View a PDF of the paper titled Rewards as Labels: Revisiting RLVR from a Classification Perspective, by Zepeng Zhai and 5 other authors View PDF HTML (experimental) Abstract:Reinforcement Learning with Verifiable Rewards has recently advanced the capabilities of Large Language Models in complex reasoning tasks by providing explicit rule-based supervision. Among RLVR methods, GRPO and its variants have achieved strong empirical performance. Despite their success, we identify that they suffer from Gradient Misassignment in Positives and Gradient Domination in Negatives, which lead to inefficient and suboptimal policy updates. To address these issues, we propose Rewards as Labels (REAL), a novel framework that revisits verifiable rewards as categorical labels rather than scalar weights, thereby reformulating policy optimization as a classification problem. Building on this, we further introduce anchor logits to enhance policy learning. Our analysis reveals that REAL induces a monotonic and bounded gradient weighting, enabling balanced gradient allocation across rollouts and effectively mitigating the identified mismatches. Extensive experiments on mathematical reasoning benchmarks show that RE...

Originally published on March 05, 2026. Curated by AI News.

Llms

Started a video series on building an orchestration layer for LLM post-training [P]

Hi everyone! Context, motivation, a lot of yapping, feel free to skip to TL;DR. A while back I posted here asking [D] What framework do y...

Reddit - Machine Learning · 1 min · about 2 hours ago

Llms

ChatGPT finally offers $100/month Pro plan

OpenAI announced on Thursday something that power users have been asking for: a $100/month plan. Previously, subscriptions jumped from $2...

TechCrunch - AI · 4 min · about 2 hours ago

Llms

Anthropic says new Claude Mythos AI is too risky for public use

Dubbed Claude Mythos, the software is part of the Claude AI family, an artificial intelligence model that can act like a chatbot and AI a...

AI Tools & Products · 10 min · about 2 hours ago

Llms

ChatGPT has a new $100 per month Pro subscription

OpenAI has announced a new version of its ChatGPT Pro subscription that costs $100 per month. The new Pro tier offers "5x more" usage of ...

The Verge - AI · 4 min · about 2 hours ago

[2602.05630] Rewards as Labels: Revisiting RLVR from a Classification Perspective

About this article

Related Articles

Started a video series on building an orchestration layer for LLM post-training [P]

ChatGPT finally offers $100/month Pro plan

Anthropic says new Claude Mythos AI is too risky for public use

ChatGPT has a new $100 per month Pro subscription

No comments

Stay updated with AI News