[2602.20628] When can we trust untrusted monitoring? A safety case sketch across collusion strategies
Summary
This paper explores the challenges of ensuring safety in AI systems using untrusted monitoring. It develops a taxonomy of collusion strategies that misaligned AIs may use and presents a safety case sketch to evaluate these strategies.
Why It Matters
As AI systems become more autonomous, the risk of misalignment increases, making safety critical. This research provides a framework for assessing the safety of untrusted monitoring, which is essential for developing reliable AI systems that can mitigate potential harms.
Key Takeaways
- Untrusted monitoring can help mitigate risks from misaligned AIs.
- The paper introduces a taxonomy of collusion strategies that AIs may employ.
- A safety case sketch is developed to present safety arguments and assumptions.
- Conditions under which passive self-recognition is effective are identified.
- The research aims to enhance evaluations of untrusted monitoring systems.
Computer Science > Artificial Intelligence arXiv:2602.20628 (cs) [Submitted on 24 Feb 2026] Title:When can we trust untrusted monitoring? A safety case sketch across collusion strategies Authors:Nelson Gardner-Challis, Jonathan Bostock, Georgiy Kozhevnikov, Morgan Sinclaire, Joan Velja, Alessandro Abate, Charlie Griffin View a PDF of the paper titled When can we trust untrusted monitoring? A safety case sketch across collusion strategies, by Nelson Gardner-Challis and 6 other authors View PDF HTML (experimental) Abstract:AIs are increasingly being deployed with greater autonomy and capabilities, which increases the risk that a misaligned AI may be able to cause catastrophic harm. Untrusted monitoring -- using one untrusted model to oversee another -- is one approach to reducing risk. Justifying the safety of an untrusted monitoring deployment is challenging because developers cannot safely deploy a misaligned model to test their protocol directly. In this paper, we develop upon existing methods for rigorously demonstrating safety based on pre-deployment testing. We relax assumptions that previous AI control research made about the collusion strategies a misaligned AI might use to subvert untrusted monitoring. We develop a taxonomy covering passive self-recognition, causal collusion (hiding pre-shared signals), acausal collusion (hiding signals via Schelling points), and combined strategies. We create a safety case sketch to clearly present our argument, explicitly state ou...