[R] Reference model free behavioral discovery of AudiBench model organisms via Probe-Mediated Adaptive Auditing
Anthropic's AuditBench - 56 Llama 3.3 70B models with planted hidden behaviors - their best agent detects the behaviros 10-13% of the tim...