[2603.01297] I Can't Believe It's Not Robust: Catastrophic Collapse of

[2603.01297] I Can't Believe It's Not Robust: Catastrophic Collapse of Safety Classifiers under Embedding Drift

arXiv - Machine Learning March 03, 2026 3 min read

About this article

Abstract page for arXiv paper 2603.01297: I Can't Believe It's Not Robust: Catastrophic Collapse of Safety Classifiers under Embedding Drift

Computer Science > Machine Learning arXiv:2603.01297 (cs) [Submitted on 1 Mar 2026] Title:I Can't Believe It's Not Robust: Catastrophic Collapse of Safety Classifiers under Embedding Drift Authors:Subramanyam Sahoo, Vinija Jain, Divya Chaudhary, Aman Chadha View a PDF of the paper titled I Can't Believe It's Not Robust: Catastrophic Collapse of Safety Classifiers under Embedding Drift, by Subramanyam Sahoo and 3 other authors View PDF HTML (experimental) Abstract:Instruction tuned reasoning models are increasingly deployed with safety classifiers trained on frozen embeddings, assuming representation stability across model updates. We systematically investigate this assumption and find it fails: normalized perturbations of magnitude $\sigma=0.02$ (corresponding to $\approx 1^\circ$ angular drift on the embedding sphere) reduce classifier performance from $85\%$ to $50\%$ ROC-AUC. Critically, mean confidence only drops $14\%$, producing dangerous silent failures where $72\%$ of misclassifications occur with high confidence, defeating standard monitoring. We further show that instruction-tuned models exhibit 20$\%$ worse class separability than base models, making aligned systems paradoxically harder to safeguard. Our findings expose a fundamental fragility in production AI safety architectures and challenge the assumption that safety mechanisms transfer across model versions. Comments: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL) Cite as: arXiv:2603.0...

Originally published on March 03, 2026. Curated by AI News.

Llms

[For Hire] Junior AI/ML Engineer | RAG · LLMs · FastAPI · Vector DBs | Remote

Posting this for a friend who isn't on Reddit. A recent graduate, entry level, no commercial production experience but spent the past yea...

Reddit - ML Jobs · 1 min · 36 minutes ago

Machine Learning

The end of AI

I am a computer science student graduating this year, as far as ai is concerned my knowledge is fairly limited and fairly high level i kn...

Reddit - Artificial Intelligence · 1 min · 36 minutes ago

Machine Learning

The gig workers who are training humanoid robots at home | MIT Technology Review

People in Nigeria and India are strapping iPhones onto their heads and recording themselves doing chores.

MIT Technology Review - AI · 9 min · 38 minutes ago

Ai Infrastructure

UMKC Announces New Master of Science in Artificial Intelligence

UMKC announces a new Master of Science in Artificial Intelligence program aimed at addressing workforce demand for AI expertise, set to l...

AI News - General · 4 min · about 3 hours ago

[2603.01297] I Can't Believe It's Not Robust: Catastrophic Collapse of Safety Classifiers under Embedding Drift

About this article

Related Articles

[For Hire] Junior AI/ML Engineer | RAG · LLMs · FastAPI · Vector DBs | Remote

The end of AI

The gig workers who are training humanoid robots at home | MIT Technology Review

UMKC Announces New Master of Science in Artificial Intelligence

No comments

Stay updated with AI News