[2604.02986] Mitigating Reward Hacking in RLHF via Advantage Sign Robustness

[2604.02986] Mitigating Reward Hacking in RLHF via Advantage Sign Robustness

arXiv - AI 3 min read

About this article

Abstract page for arXiv paper 2604.02986: Mitigating Reward Hacking in RLHF via Advantage Sign Robustness

Computer Science > Machine Learning arXiv:2604.02986 (cs) [Submitted on 3 Apr 2026] Title:Mitigating Reward Hacking in RLHF via Advantage Sign Robustness Authors:Shinnosuke Ono, Johannes Ackermann, Soichiro Nishimori, Takashi Ishida, Masashi Sugiyama View a PDF of the paper titled Mitigating Reward Hacking in RLHF via Advantage Sign Robustness, by Shinnosuke Ono and 4 other authors View PDF HTML (experimental) Abstract:Reward models (RMs) used in reinforcement learning from human feedback (RLHF) are vulnerable to reward hacking: as the policy maximizes a learned proxy reward, true quality plateaus or degrades. We make the assumption that reward hacking is often caused by flipped advantage signs: instead of reducing the likelihood of a bad response, a flipped sign causes the update to increase it. By considering an adversarial perturbation in the RM parameter space, we can derive a certified sign-preservation radius, which is the smallest perturbation that can flip the advantage sign during policy optimization. Based on this formulation, we propose Sign-Certified Policy Optimization (SignCert-PO), down-weighting non-robust completions in the policy gradient update. Unlike prior approaches that require multiple RMs or access to the RM training data, SignCert-PO is lightweight and operates purely at the policy optimization stage using only the RM parameters and on-policy completions. On TL;DR summarization and AlpacaFarm benchmarks, SignCert-PO consistently achieves a better ...

Originally published on April 06, 2026. Curated by AI News.

Related Articles

Machine Learning

Cold start latency on GPU cloud platforms in 2026 — p99 specifically, not p50. Anyone have real data? [D]

doing infrastructure evaluation for inference workloads and running into the same problem everywhere: every platform publishes p50 cold s...

Reddit - Machine Learning · 1 min ·
Google’s Gemini AI can answer your questions with 3D models and simulations | The Verge
Llms

Google’s Gemini AI can answer your questions with 3D models and simulations | The Verge

Google is rolling out a new feature for its Gemini AI chatbot, allowing the tool to generate 3D models and simulations to explain the con...

The Verge - AI · 4 min ·
Machine Learning

Flux maintains facial geometry and spatial coherence across 5 sequential iterative edits - is anything else doing this at this level?

One woman. 5 Different Prompts. Perfect Contextual Preservation Playing around with Flux again and thought I'll try it with a model chang...

Reddit - Artificial Intelligence · 1 min ·
Machine Learning

[P] PCA before truncation makes non-Matryoshka embeddings compressible: results on BGE-M3 [P]

Most embedding models are not Matryoshka-trained, so naive dimension truncation tends to destroy them. I tested a simple alternative: fit...

Reddit - Machine Learning · 1 min ·
More in Machine Learning: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime