[2603.00829] Constitutional Black-Box Monitoring for Scheming in LLM Agents
About this article
Abstract page for arXiv paper 2603.00829: Constitutional Black-Box Monitoring for Scheming in LLM Agents
Computer Science > Computation and Language arXiv:2603.00829 (cs) [Submitted on 28 Feb 2026] Title:Constitutional Black-Box Monitoring for Scheming in LLM Agents Authors:Simon Storf, Rich Barton-Cooper, James Peters-Gill, Marius Hobbhahn View a PDF of the paper titled Constitutional Black-Box Monitoring for Scheming in LLM Agents, by Simon Storf and 3 other authors View PDF HTML (experimental) Abstract:Safe deployment of Large Language Model (LLM) agents in autonomous settings requires reliable oversight mechanisms. A central challenge is detecting scheming, where agents covertly pursue misaligned goals. One approach to mitigating such risks is LLM-based monitoring: using language models to examine agent behaviors for suspicious actions. We study constitutional black-box monitors: prompted classifiers that detect scheming using only externally observable inputs and outputs, optimized on synthetic data generated from natural-language behavior specifications. We introduce two pipelines for generating synthetic agent trajectories, STRIDE (iterative refinement) and Gloom (agent-environment simulation), from which we generate 1,000 samples each. We optimize frontier LLM monitors on these datasets via prompt sweeps, human refinement, and automated prompt optimization, and evaluate performance on 7,500 held-out trajectories from ControlArena, a suite of grounded environments where agents operate in more realistic contexts. Our results demonstrate that monitors selected purely on ...