[2602.05630] Rewards as Labels: Revisiting RLVR from a Classification Perspective
About this article
Abstract page for arXiv paper 2602.05630: Rewards as Labels: Revisiting RLVR from a Classification Perspective
Computer Science > Machine Learning arXiv:2602.05630 (cs) [Submitted on 5 Feb 2026 (v1), last revised 4 Mar 2026 (this version, v2)] Title:Rewards as Labels: Revisiting RLVR from a Classification Perspective Authors:Zepeng Zhai, Meilin Chen, Jiaxuan Zhao, Junlang Qian, Lei Shen, Yuan Lu View a PDF of the paper titled Rewards as Labels: Revisiting RLVR from a Classification Perspective, by Zepeng Zhai and 5 other authors View PDF HTML (experimental) Abstract:Reinforcement Learning with Verifiable Rewards has recently advanced the capabilities of Large Language Models in complex reasoning tasks by providing explicit rule-based supervision. Among RLVR methods, GRPO and its variants have achieved strong empirical performance. Despite their success, we identify that they suffer from Gradient Misassignment in Positives and Gradient Domination in Negatives, which lead to inefficient and suboptimal policy updates. To address these issues, we propose Rewards as Labels (REAL), a novel framework that revisits verifiable rewards as categorical labels rather than scalar weights, thereby reformulating policy optimization as a classification problem. Building on this, we further introduce anchor logits to enhance policy learning. Our analysis reveals that REAL induces a monotonic and bounded gradient weighting, enabling balanced gradient allocation across rollouts and effectively mitigating the identified mismatches. Extensive experiments on mathematical reasoning benchmarks show that RE...