[R] Reference model free behavioral discovery of AudiBench model organisms via Probe-Mediated Adaptive Auditing
About this article
Anthropic's AuditBench - 56 Llama 3.3 70B models with planted hidden behaviors - their best agent detects the behaviros 10-13% of the time (42% with a super-agent aggregating many parallel runs). a central finding is the "tool-to-agent gap" - white-box interpretability tools that work in standalone evaluation fail to help the agent in practice. most auditing work uses the base model as a reference to compare against. i wanted to know if you can detect these modifications blind - no reference ...
You've been blocked by network security.To continue, log in to your Reddit account or use your developer tokenIf you think you've been blocked by mistake, file a ticket below and we'll look into it.Log in File a ticket