[2604.00072] Empirical Validation of the Classification-Verification Dichotomy for AI Safety Gates
About this article
Abstract page for arXiv paper 2604.00072: Empirical Validation of the Classification-Verification Dichotomy for AI Safety Gates
Computer Science > Machine Learning arXiv:2604.00072 (cs) [Submitted on 31 Mar 2026] Title:Empirical Validation of the Classification-Verification Dichotomy for AI Safety Gates Authors:Arsenios Scrivens View a PDF of the paper titled Empirical Validation of the Classification-Verification Dichotomy for AI Safety Gates, by Arsenios Scrivens View PDF HTML (experimental) Abstract:Can classifier-based safety gates maintain reliable oversight as AI systems improve over hundreds of iterations? We provide comprehensive empirical evidence that they cannot. On a self-improving neural controller (d=240), eighteen classifier configurations -- spanning MLPs, SVMs, random forests, k-NN, Bayesian classifiers, and deep networks -- all fail the dual conditions for safe self-improvement. Three safe RL baselines (CPO, Lyapunov, safety shielding) also fail. Results extend to MuJoCo benchmarks (Reacher-v4 d=496, Swimmer-v4 d=1408, HalfCheetah-v4 d=1824). At controlled distribution separations up to delta_s=2.0, all classifiers still fail -- including the NP-optimal test and MLPs with 100% training accuracy -- demonstrating structural impossibility. We then show the impossibility is specific to classification, not to safe self-improvement itself. A Lipschitz ball verifier achieves zero false accepts across dimensions d in {84, 240, 768, 2688, 5760, 9984, 17408} using provable analytical bounds (unconditional delta=0). Ball chaining enables unbounded parameter-space traversal: on MuJoCo Reacher...