[2603.02202] Frontier Models Can Take Actions at Low Probabilities
About this article
Abstract page for arXiv paper 2603.02202: Frontier Models Can Take Actions at Low Probabilities
Computer Science > Machine Learning arXiv:2603.02202 (cs) [Submitted on 2 Mar 2026] Title:Frontier Models Can Take Actions at Low Probabilities Authors:Alex Serrano, Wen Xing, David Lindner, Erik Jenner View a PDF of the paper titled Frontier Models Can Take Actions at Low Probabilities, by Alex Serrano and 3 other authors View PDF HTML (experimental) Abstract:Pre-deployment evaluations inspect only a limited sample of model actions. A malicious model seeking to evade oversight could exploit this by randomizing when to "defect": misbehaving so rarely that no malicious actions are observed during evaluation, but often enough that they occur eventually in deployment. But this requires taking actions at very low rates, while maintaining calibration. Are frontier models even capable of that? We prompt the GPT-5, Claude-4.5 and Qwen-3 families to take a target action at low probabilities (e.g. 0.01%), either given directly or requiring derivation, and evaluate their calibration (i.e. whether they perform the target action roughly 1 in 10,000 times when resampling). We find that frontier models are surprisingly good at this task. If there is a source of entropy in-context (such as a UUID), they maintain high calibration at rates lower than 1 in 100,000 actions. Without external entropy, some models can still reach rates lower than 1 in 10,000. When target rates are given, larger models achieve good calibration at lower rates. Yet, when models must derive the optimal target rate ...