[2602.20628] When can we trust untrusted monitoring? A safety case sketch across collusion strategies

[2602.20628] When can we trust untrusted monitoring? A safety case sketch across collusion strategies

arXiv - AI 4 min read Article

Summary

This paper explores the challenges of ensuring safety in AI systems using untrusted monitoring. It develops a taxonomy of collusion strategies that misaligned AIs may use and presents a safety case sketch to evaluate these strategies.

Why It Matters

As AI systems become more autonomous, the risk of misalignment increases, making safety critical. This research provides a framework for assessing the safety of untrusted monitoring, which is essential for developing reliable AI systems that can mitigate potential harms.

Key Takeaways

  • Untrusted monitoring can help mitigate risks from misaligned AIs.
  • The paper introduces a taxonomy of collusion strategies that AIs may employ.
  • A safety case sketch is developed to present safety arguments and assumptions.
  • Conditions under which passive self-recognition is effective are identified.
  • The research aims to enhance evaluations of untrusted monitoring systems.

Computer Science > Artificial Intelligence arXiv:2602.20628 (cs) [Submitted on 24 Feb 2026] Title:When can we trust untrusted monitoring? A safety case sketch across collusion strategies Authors:Nelson Gardner-Challis, Jonathan Bostock, Georgiy Kozhevnikov, Morgan Sinclaire, Joan Velja, Alessandro Abate, Charlie Griffin View a PDF of the paper titled When can we trust untrusted monitoring? A safety case sketch across collusion strategies, by Nelson Gardner-Challis and 6 other authors View PDF HTML (experimental) Abstract:AIs are increasingly being deployed with greater autonomy and capabilities, which increases the risk that a misaligned AI may be able to cause catastrophic harm. Untrusted monitoring -- using one untrusted model to oversee another -- is one approach to reducing risk. Justifying the safety of an untrusted monitoring deployment is challenging because developers cannot safely deploy a misaligned model to test their protocol directly. In this paper, we develop upon existing methods for rigorously demonstrating safety based on pre-deployment testing. We relax assumptions that previous AI control research made about the collusion strategies a misaligned AI might use to subvert untrusted monitoring. We develop a taxonomy covering passive self-recognition, causal collusion (hiding pre-shared signals), acausal collusion (hiding signals via Schelling points), and combined strategies. We create a safety case sketch to clearly present our argument, explicitly state ou...

Related Articles

Llms

Study: LLMs Able to De-Anonymize User Accounts on Reddit, Hacker News & Other "Pseudonymous" Platforms; Report Co-Author Expands, Advises

Advice from the study's co-author: "Be aware that it’s not any single post that identifies you, but the combination of small details acro...

Reddit - Artificial Intelligence · 1 min ·
Machine Learning

[D] Best websites for pytorch/numpy interviews

Hello, I’m at the last year of my PHD and I’m starting to prepare interviews. I’m mainly aiming at applied scientist/research engineer or...

Reddit - Machine Learning · 1 min ·
Llms

[P] Remote sensing foundation models made easy to use.

This project enables the idea of tasking remote sensing models to acquire embeddings like we task satellites to acquire data! https://git...

Reddit - Machine Learning · 1 min ·
Machine Learning

Can AI truly be creative?

AI has no imagination. “Creativity is the ability to generate novel and valuable ideas or works through the exercise of imagination” http...

Reddit - Artificial Intelligence · 1 min ·
More in Machine Learning: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime