Machine Learning Ai Infrastructure Ai Safety Ai Agents

[2602.20628] When can we trust untrusted monitoring? A safety case sketch across collusion strategies

arXiv - AI February 25, 2026 4 min read Article

Summary

This paper explores the challenges of ensuring safety in AI systems using untrusted monitoring. It develops a taxonomy of collusion strategies that misaligned AIs may use and presents a safety case sketch to evaluate these strategies.

Why It Matters

As AI systems become more autonomous, the risk of misalignment increases, making safety critical. This research provides a framework for assessing the safety of untrusted monitoring, which is essential for developing reliable AI systems that can mitigate potential harms.

Key Takeaways

Untrusted monitoring can help mitigate risks from misaligned AIs.
The paper introduces a taxonomy of collusion strategies that AIs may employ.
A safety case sketch is developed to present safety arguments and assumptions.
Conditions under which passive self-recognition is effective are identified.
The research aims to enhance evaluations of untrusted monitoring systems.

Computer Science > Artificial Intelligence arXiv:2602.20628 (cs) [Submitted on 24 Feb 2026] Title:When can we trust untrusted monitoring? A safety case sketch across collusion strategies Authors:Nelson Gardner-Challis, Jonathan Bostock, Georgiy Kozhevnikov, Morgan Sinclaire, Joan Velja, Alessandro Abate, Charlie Griffin View a PDF of the paper titled When can we trust untrusted monitoring? A safety case sketch across collusion strategies, by Nelson Gardner-Challis and 6 other authors View PDF HTML (experimental) Abstract:AIs are increasingly being deployed with greater autonomy and capabilities, which increases the risk that a misaligned AI may be able to cause catastrophic harm. Untrusted monitoring -- using one untrusted model to oversee another -- is one approach to reducing risk. Justifying the safety of an untrusted monitoring deployment is challenging because developers cannot safely deploy a misaligned model to test their protocol directly. In this paper, we develop upon existing methods for rigorously demonstrating safety based on pre-deployment testing. We relax assumptions that previous AI control research made about the collusion strategies a misaligned AI might use to subvert untrusted monitoring. We develop a taxonomy covering passive self-recognition, causal collusion (hiding pre-shared signals), acausal collusion (hiding signals via Schelling points), and combined strategies. We create a safety case sketch to clearly present our argument, explicitly state ou...

Read Original Article

[2602.20628] When can we trust untrusted monitoring? A safety case sketch across collusion strategies

Summary

Why It Matters

Key Takeaways

Related Articles

Study: LLMs Able to De-Anonymize User Accounts on Reddit, Hacker News & Other "Pseudonymous" Platforms; Report Co-Author Expands, Advises

[D] Best websites for pytorch/numpy interviews

[P] Remote sensing foundation models made easy to use.

Can AI truly be creative?

No comments

Stay updated with AI News