[2603.01297] I Can't Believe It's Not Robust: Catastrophic Collapse of Safety Classifiers under Embedding Drift

[2603.01297] I Can't Believe It's Not Robust: Catastrophic Collapse of Safety Classifiers under Embedding Drift

arXiv - Machine Learning 3 min read

About this article

Abstract page for arXiv paper 2603.01297: I Can't Believe It's Not Robust: Catastrophic Collapse of Safety Classifiers under Embedding Drift

Computer Science > Machine Learning arXiv:2603.01297 (cs) [Submitted on 1 Mar 2026] Title:I Can't Believe It's Not Robust: Catastrophic Collapse of Safety Classifiers under Embedding Drift Authors:Subramanyam Sahoo, Vinija Jain, Divya Chaudhary, Aman Chadha View a PDF of the paper titled I Can't Believe It's Not Robust: Catastrophic Collapse of Safety Classifiers under Embedding Drift, by Subramanyam Sahoo and 3 other authors View PDF HTML (experimental) Abstract:Instruction tuned reasoning models are increasingly deployed with safety classifiers trained on frozen embeddings, assuming representation stability across model updates. We systematically investigate this assumption and find it fails: normalized perturbations of magnitude $\sigma=0.02$ (corresponding to $\approx 1^\circ$ angular drift on the embedding sphere) reduce classifier performance from $85\%$ to $50\%$ ROC-AUC. Critically, mean confidence only drops $14\%$, producing dangerous silent failures where $72\%$ of misclassifications occur with high confidence, defeating standard monitoring. We further show that instruction-tuned models exhibit 20$\%$ worse class separability than base models, making aligned systems paradoxically harder to safeguard. Our findings expose a fundamental fragility in production AI safety architectures and challenge the assumption that safety mechanisms transfer across model versions. Comments: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL) Cite as: arXiv:2603.0...

Originally published on March 03, 2026. Curated by AI News.

Related Articles

Llms

[For Hire] Junior AI/ML Engineer | RAG · LLMs · FastAPI · Vector DBs | Remote

Posting this for a friend who isn't on Reddit. A recent graduate, entry level, no commercial production experience but spent the past yea...

Reddit - ML Jobs · 1 min ·
Machine Learning

The end of AI

I am a computer science student graduating this year, as far as ai is concerned my knowledge is fairly limited and fairly high level i kn...

Reddit - Artificial Intelligence · 1 min ·
The gig workers who are training humanoid robots at home | MIT Technology Review
Machine Learning

The gig workers who are training humanoid robots at home | MIT Technology Review

People in Nigeria and India are strapping iPhones onto their heads and recording themselves doing chores.

MIT Technology Review - AI · 9 min ·
UMKC Announces New Master of Science in Artificial Intelligence
Ai Infrastructure

UMKC Announces New Master of Science in Artificial Intelligence

UMKC announces a new Master of Science in Artificial Intelligence program aimed at addressing workforce demand for AI expertise, set to l...

AI News - General · 4 min ·
More in Machine Learning: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime