[2604.02986] Mitigating Reward Hacking in RLHF via Advantage Sign

[2604.02986] Mitigating Reward Hacking in RLHF via Advantage Sign Robustness

arXiv - AI April 06, 2026 3 min read

About this article

Abstract page for arXiv paper 2604.02986: Mitigating Reward Hacking in RLHF via Advantage Sign Robustness

Computer Science > Machine Learning arXiv:2604.02986 (cs) [Submitted on 3 Apr 2026] Title:Mitigating Reward Hacking in RLHF via Advantage Sign Robustness Authors:Shinnosuke Ono, Johannes Ackermann, Soichiro Nishimori, Takashi Ishida, Masashi Sugiyama View a PDF of the paper titled Mitigating Reward Hacking in RLHF via Advantage Sign Robustness, by Shinnosuke Ono and 4 other authors View PDF HTML (experimental) Abstract:Reward models (RMs) used in reinforcement learning from human feedback (RLHF) are vulnerable to reward hacking: as the policy maximizes a learned proxy reward, true quality plateaus or degrades. We make the assumption that reward hacking is often caused by flipped advantage signs: instead of reducing the likelihood of a bad response, a flipped sign causes the update to increase it. By considering an adversarial perturbation in the RM parameter space, we can derive a certified sign-preservation radius, which is the smallest perturbation that can flip the advantage sign during policy optimization. Based on this formulation, we propose Sign-Certified Policy Optimization (SignCert-PO), down-weighting non-robust completions in the policy gradient update. Unlike prior approaches that require multiple RMs or access to the RM training data, SignCert-PO is lightweight and operates purely at the policy optimization stage using only the RM parameters and on-policy completions. On TL;DR summarization and AlpacaFarm benchmarks, SignCert-PO consistently achieves a better ...

Originally published on April 06, 2026. Curated by AI News.

Machine Learning

Cold start latency on GPU cloud platforms in 2026 — p99 specifically, not p50. Anyone have real data? [D]

doing infrastructure evaluation for inference workloads and running into the same problem everywhere: every platform publishes p50 cold s...

Reddit - Machine Learning · 1 min · 35 minutes ago

Llms

Google’s Gemini AI can answer your questions with 3D models and simulations | The Verge

Google is rolling out a new feature for its Gemini AI chatbot, allowing the tool to generate 3D models and simulations to explain the con...

The Verge - AI · 4 min · 35 minutes ago

Machine Learning

Flux maintains facial geometry and spatial coherence across 5 sequential iterative edits - is anything else doing this at this level?

One woman. 5 Different Prompts. Perfect Contextual Preservation Playing around with Flux again and thought I'll try it with a model chang...

Reddit - Artificial Intelligence · 1 min · about 2 hours ago

Machine Learning

[P] PCA before truncation makes non-Matryoshka embeddings compressible: results on BGE-M3 [P]

Most embedding models are not Matryoshka-trained, so naive dimension truncation tends to destroy them. I tested a simple alternative: fit...

Reddit - Machine Learning · 1 min · about 3 hours ago

[2604.02986] Mitigating Reward Hacking in RLHF via Advantage Sign Robustness

About this article

Related Articles

Cold start latency on GPU cloud platforms in 2026 — p99 specifically, not p50. Anyone have real data? [D]

Google’s Gemini AI can answer your questions with 3D models and simulations | The Verge

Flux maintains facial geometry and spatial coherence across 5 sequential iterative edits - is anything else doing this at this level?

[P] PCA before truncation makes non-Matryoshka embeddings compressible: results on BGE-M3 [P]

No comments

Stay updated with AI News