[2511.04439] CoRPO: Adding a Correctness Bias to GRPO Improves

[2511.04439] CoRPO: Adding a Correctness Bias to GRPO Improves Generalization

arXiv - Machine Learning March 06, 2026 4 min read

About this article

Abstract page for arXiv paper 2511.04439: CoRPO: Adding a Correctness Bias to GRPO Improves Generalization

Computer Science > Artificial Intelligence arXiv:2511.04439 (cs) [Submitted on 6 Nov 2025 (v1), last revised 4 Mar 2026 (this version, v3)] Title:CoRPO: Adding a Correctness Bias to GRPO Improves Generalization Authors:Anisha Garg, Claire Zhang, Nishit Neema, David Bick, Ganesh Venkatesh, Joel Hestness View a PDF of the paper titled CoRPO: Adding a Correctness Bias to GRPO Improves Generalization, by Anisha Garg and 5 other authors View PDF HTML (experimental) Abstract:Group-Relative Policy Optimization (GRPO) has emerged as the standard for training reasoning capabilities in large language models through reinforcement learning. By estimating advantages using group-mean rewards rather than a learned critic, GRPO has enabled efficient scaling of reinforcement learning from verifiable rewards (RLVR). However, we identify a fundamental limitation: GRPO's mean baseline can assign positive advantages to incorrect solutions simply because they outperform a poorly-performing group average. It leads to overestimation of advantages and reinforcement of incorrect behaviours. To address this, we propose Correctness-Relative Policy Optimization (CoRPO), a simple modification to the GRPO objective that clips the minimum baseline to a fixed correctness threshold. We show that baseline clipping introduces a protective bias to advantage estimation that mitigates overfitting while preserving effective exploration. Empirically, CoRPO-trained models improve cross-domain reasoning, generalizi...

Originally published on March 06, 2026. Curated by AI News.

Llms

Bluesky’s new app is an AI for customizing your feed | The Verge

Eventually Attie will be able to vibe code entire apps for the AT Protocol.

The Verge - AI · 3 min · about 2 hours ago

Llms

Nicolas Carlini (67.2k citations on Google Scholar) says Claude is a better security researcher than him, made $3.7 million from exploiting smart contracts, and found vulnerabilities in Linux and Ghost

Link: https://m.youtube.com/watch?v=1sd26pWhfmg The Linux exploit is especially interesting because it was introduced in 2003 and was nev...

Reddit - Artificial Intelligence · 1 min · about 4 hours ago

Llms

[P] I built an autonomous ML agent that runs experiments on tabular data indefinitely - inspired by Karpathy's AutoResearch

Inspired by Andrej Karpathy's AutoResearch, I built a system where Claude Code acts as an autonomous ML researcher on tabular binary clas...

Reddit - Machine Learning · 1 min · about 4 hours ago

Llms

[R] BraiNN: An Experimental Neural Architecture with Working Memory, Relational Reasoning, and Adaptive Learning

BraiNN An Experimental Neural Architecture with Working Memory, Relational Reasoning, and Adaptive Learning BraiNN is a compact research‑...

Reddit - Machine Learning · 1 min · about 5 hours ago

[2511.04439] CoRPO: Adding a Correctness Bias to GRPO Improves Generalization

About this article

Related Articles

Bluesky’s new app is an AI for customizing your feed | The Verge

Nicolas Carlini (67.2k citations on Google Scholar) says Claude is a better security researcher than him, made $3.7 million from exploiting smart contracts, and found vulnerabilities in Linux and Ghost

[P] I built an autonomous ML agent that runs experiments on tabular data indefinitely - inspired by Karpathy's AutoResearch

[R] BraiNN: An Experimental Neural Architecture with Working Memory, Relational Reasoning, and Adaptive Learning

No comments

Stay updated with AI News