[2603.00829] Constitutional Black-Box Monitoring for Scheming in LLM

[2603.00829] Constitutional Black-Box Monitoring for Scheming in LLM Agents

arXiv - Machine Learning March 03, 2026 4 min read

About this article

Abstract page for arXiv paper 2603.00829: Constitutional Black-Box Monitoring for Scheming in LLM Agents

Computer Science > Computation and Language arXiv:2603.00829 (cs) [Submitted on 28 Feb 2026] Title:Constitutional Black-Box Monitoring for Scheming in LLM Agents Authors:Simon Storf, Rich Barton-Cooper, James Peters-Gill, Marius Hobbhahn View a PDF of the paper titled Constitutional Black-Box Monitoring for Scheming in LLM Agents, by Simon Storf and 3 other authors View PDF HTML (experimental) Abstract:Safe deployment of Large Language Model (LLM) agents in autonomous settings requires reliable oversight mechanisms. A central challenge is detecting scheming, where agents covertly pursue misaligned goals. One approach to mitigating such risks is LLM-based monitoring: using language models to examine agent behaviors for suspicious actions. We study constitutional black-box monitors: prompted classifiers that detect scheming using only externally observable inputs and outputs, optimized on synthetic data generated from natural-language behavior specifications. We introduce two pipelines for generating synthetic agent trajectories, STRIDE (iterative refinement) and Gloom (agent-environment simulation), from which we generate 1,000 samples each. We optimize frontier LLM monitors on these datasets via prompt sweeps, human refinement, and automated prompt optimization, and evaluate performance on 7,500 held-out trajectories from ControlArena, a suite of grounded environments where agents operate in more realistic contexts. Our results demonstrate that monitors selected purely on ...

Originally published on March 03, 2026. Curated by AI News.

Llms

Why are we blindly trusting AI companies with our data?

Lately I’ve been seeing a story floating around that really made me pause. Apparently, there were claims that the US government asked Ant...

Reddit - Artificial Intelligence · 1 min · about 1 hour ago

Llms

De-aged casts, ChatGPT-generated programs: How AI is changing Korean TV

Artificial intelligence is transforming every corner of industry, and television is no exception. Major networks in Korea have recently a...

AI Tools & Products · 4 min · about 2 hours ago

Llms

[2603.16629] MLLM-based Textual Explanations for Face Comparison

Abstract page for arXiv paper 2603.16629: MLLM-based Textual Explanations for Face Comparison

arXiv - AI · 4 min · about 3 hours ago

Llms

[2603.15159] To See is Not to Master: Teaching LLMs to Use Private Libraries for Code Generation

Abstract page for arXiv paper 2603.15159: To See is Not to Master: Teaching LLMs to Use Private Libraries for Code Generation

arXiv - AI · 4 min · about 3 hours ago

[2603.00829] Constitutional Black-Box Monitoring for Scheming in LLM Agents

About this article

Related Articles

Why are we blindly trusting AI companies with our data?

De-aged casts, ChatGPT-generated programs: How AI is changing Korean TV

[2603.16629] MLLM-based Textual Explanations for Face Comparison

[2603.15159] To See is Not to Master: Teaching LLMs to Use Private Libraries for Code Generation

No comments

Stay updated with AI News