[2603.28590] MonitorBench: A Comprehensive Benchmark for Chain-of-Thought Monitorability in Large Language Models
About this article
Abstract page for arXiv paper 2603.28590: MonitorBench: A Comprehensive Benchmark for Chain-of-Thought Monitorability in Large Language Models
Computer Science > Artificial Intelligence arXiv:2603.28590 (cs) [Submitted on 30 Mar 2026] Title:MonitorBench: A Comprehensive Benchmark for Chain-of-Thought Monitorability in Large Language Models Authors:Han Wang, Yifan Sun, Brian Ko, Mann Talati, Jiawen Gong, Zimeng Li, Naicheng Yu, Xucheng Yu, Wei Shen, Vedant Jolly, Huan Zhang View a PDF of the paper titled MonitorBench: A Comprehensive Benchmark for Chain-of-Thought Monitorability in Large Language Models, by Han Wang and 10 other authors View PDF HTML (experimental) Abstract:Large language models (LLMs) can generate chains of thought (CoTs) that are not always causally responsible for their final outputs. When such a mismatch occurs, the CoT no longer faithfully reflects the decision-critical factors driving the model's behavior, leading to the reduced CoT monitorability problem. However, a comprehensive and fully open-source benchmark for studying CoT monitorability remains lacking. To address this gap, we propose MonitorBench, a systematic benchmark for evaluating CoT monitorability in LLMs. MonitorBench provides: (1) a diverse set of 1,514 test instances with carefully designed decision-critical factors across 19 tasks spanning 7 categories to characterize when CoTs can be used to monitor the factors driving LLM behavior; and (2) two stress-test settings to quantify the extent to which CoT monitorability can be degraded. Extensive experiments across multiple popular LLMs with varying capabilities show that CoT m...