[2604.00324] The Persistent Vulnerability of Aligned AI Systems

[2604.00324] The Persistent Vulnerability of Aligned AI Systems

arXiv - AI 4 min read

About this article

Abstract page for arXiv paper 2604.00324: The Persistent Vulnerability of Aligned AI Systems

Computer Science > Machine Learning arXiv:2604.00324 (cs) [Submitted on 31 Mar 2026] Title:The Persistent Vulnerability of Aligned AI Systems Authors:Aengus Lynch View a PDF of the paper titled The Persistent Vulnerability of Aligned AI Systems, by Aengus Lynch View PDF HTML (experimental) Abstract:Autonomous AI agents are being deployed with filesystem access, email control, and multi-step planning. This thesis contributes to four open problems in AI safety: understanding dangerous internal computations, removing dangerous behaviors once embedded, testing for vulnerabilities before deployment, and predicting when models will act against deployers. ACDC automates circuit discovery in transformers, recovering all five component types from prior manual work on GPT-2 Small by selecting 68 edges from 32,000 candidates in hours rather than months. Latent Adversarial Training (LAT) removes dangerous behaviors by optimizing perturbations in the residual stream to elicit failure modes, then training under those perturbations. LAT solved the sleeper agent problem where standard safety training failed, matching existing defenses with 700x fewer GPU hours. Best-of-N jailbreaking achieves 89% attack success on GPT-4o and 78% on Claude 3.5 Sonnet through random input augmentations. Attack success follows power law scaling across text, vision, and audio, enabling quantitative forecasting of adversarial robustness. Agentic misalignment tests whether frontier models autonomously choose ha...

Originally published on April 02, 2026. Curated by AI News.

Related Articles

Anthropic Teams Up With Its Rivals to Keep AI From Hacking Everything | WIRED
Llms

Anthropic Teams Up With Its Rivals to Keep AI From Hacking Everything | WIRED

The AI lab's Project Glasswing will bring together Apple, Google, and more than 45 other organizations. They'll use the new Claude Mythos...

Wired - AI · 7 min ·
Machine Learning

[for hire] Open for contracts – Veteran Data Scientist (AI / ML / OR) focused on delivering real‑world solutions.

Hi Reddit, I've spent 20 years working with data, and I've learned how to crack problems that AI systems struggle with. I've got a knack ...

Reddit - ML Jobs · 1 min ·
Llms

The public needs to control AI-run infrastructure, labor, education, and governance— NOT private actors

A lot of discussion around AI is becoming siloed, and I think that is dangerous. People in AI-focused spaces often talk as if the only qu...

Reddit - Artificial Intelligence · 1 min ·
Machine Learning

[D] ICML final justification

Do we get notified if any reviewer put their final justification into their original review comment? submitted by /u/tuejan11 [link] [com...

Reddit - Machine Learning · 1 min ·
More in Machine Learning: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime