[2604.02652] Generalization Limits of Reinforcement Learning Alignment

arXiv - AI April 06, 2026 3 min read

About this article

Abstract page for arXiv paper 2604.02652: Generalization Limits of Reinforcement Learning Alignment

Computer Science > Machine Learning arXiv:2604.02652 (cs) [Submitted on 3 Apr 2026] Title:Generalization Limits of Reinforcement Learning Alignment Authors:Haruhi Shida, Koo Imai, Keigo Kansa View a PDF of the paper titled Generalization Limits of Reinforcement Learning Alignment, by Haruhi Shida and 2 other authors View PDF HTML (experimental) Abstract:The safety of large language models (LLMs) relies on alignment techniques such as reinforcement learning from human feedback (RLHF). However, recent theoretical analyses suggest that reinforcement learning-based training does not acquire new capabilities but merely redistributes the utilization probabilities of existing ones. In this study, we propose ``compound jailbreaks'' targeting OpenAI gpt-oss-20b, which exploit the generalization failures of alignment. This approach combines multiple attack techniques -- each individually defended against -- to saturate the instruction hierarchy maintenance process. Our evaluation shows that the attack success rate (ASR) increased from 14.3\% with individual methods to 71.4\% with the combined approach. These results provide empirical evidence for the hypothesis that safety training does not generalize as broadly as model capabilities, highlighting the need for multifaceted safety evaluations using compound attack scenarios. Comments: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.02652 [cs.LG] (or arXiv:2604.02652v1 [cs.LG] for this version...

Originally published on April 06, 2026. Curated by AI News.

Llms

ChatGPT finally offers $100/month Pro plan | TechCrunch

OpenAI announced on Thursday something that power users have been asking for: a $100/month plan. Previously, subscriptions jumped from $2...

TechCrunch - AI · 4 min · 24 minutes ago

Llms

Florida AG announces investigation into OpenAI over shooting that allegedly involved ChatGPT | TechCrunch

ChatGPT had reportedly been used to plan the attack that killed two and injured five at Florida State University last April. The family o...

TechCrunch - AI · 4 min · about 2 hours ago

Llms

We’re open-sourcing a 33-benchmark diagnostic for AI alignment gaps, launches April 27

On April 27 we’re open-sourcing a free diagnostic tool called iFixAi. You run it against your AI system (agent, copilot, LLM integration,...

Reddit - Artificial Intelligence · 1 min · about 3 hours ago

Llms

Google’s Gemini AI can answer your questions with 3D models and simulations

submitted by /u/tekz [link] [comments]

Reddit - Artificial Intelligence · 1 min · about 3 hours ago

[2604.02652] Generalization Limits of Reinforcement Learning Alignment

About this article

Related Articles

ChatGPT finally offers $100/month Pro plan | TechCrunch

Florida AG announces investigation into OpenAI over shooting that allegedly involved ChatGPT | TechCrunch

We’re open-sourcing a 33-benchmark diagnostic for AI alignment gaps, launches April 27

Google’s Gemini AI can answer your questions with 3D models and simulations

No comments

Stay updated with AI News