[R] Detection Is Cheap, Routing Is Learned: Why Refusal-Based Alignment Evaluation Fails (arXiv 2603.18280)
About this article
Paper: https://arxiv.org/abs/2603.18280 TL;DR: Current alignment evaluation measures concept detection (probing) and refusal (benchmarking), but alignment primarily operates through a learned routing mechanism between these - and that routing is lab-specific, fragile, and invisible to refusal-based benchmarks. We use political censorship in Chinese-origin LLMs as a natural experiment because it gives us known ground truth and wide behavioral variation across labs. Setup: Nine open-weight mode...
You've been blocked by network security.To continue, log in to your Reddit account or use your developer tokenIf you think you've been blocked by mistake, file a ticket below and we'll look into it.Log in File a ticket