[2603.01297] I Can't Believe It's Not Robust: Catastrophic Collapse of Safety Classifiers under Embedding Drift
About this article
Abstract page for arXiv paper 2603.01297: I Can't Believe It's Not Robust: Catastrophic Collapse of Safety Classifiers under Embedding Drift
Computer Science > Machine Learning arXiv:2603.01297 (cs) [Submitted on 1 Mar 2026] Title:I Can't Believe It's Not Robust: Catastrophic Collapse of Safety Classifiers under Embedding Drift Authors:Subramanyam Sahoo, Vinija Jain, Divya Chaudhary, Aman Chadha View a PDF of the paper titled I Can't Believe It's Not Robust: Catastrophic Collapse of Safety Classifiers under Embedding Drift, by Subramanyam Sahoo and 3 other authors View PDF HTML (experimental) Abstract:Instruction tuned reasoning models are increasingly deployed with safety classifiers trained on frozen embeddings, assuming representation stability across model updates. We systematically investigate this assumption and find it fails: normalized perturbations of magnitude $\sigma=0.02$ (corresponding to $\approx 1^\circ$ angular drift on the embedding sphere) reduce classifier performance from $85\%$ to $50\%$ ROC-AUC. Critically, mean confidence only drops $14\%$, producing dangerous silent failures where $72\%$ of misclassifications occur with high confidence, defeating standard monitoring. We further show that instruction-tuned models exhibit 20$\%$ worse class separability than base models, making aligned systems paradoxically harder to safeguard. Our findings expose a fundamental fragility in production AI safety architectures and challenge the assumption that safety mechanisms transfer across model versions. Comments: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL) Cite as: arXiv:2603.0...