[2602.16520] Recursive language models for jailbreak detection: a procedural defense for tool-augmented agents
Summary
The paper presents RLM-JB, a framework utilizing Recursive Language Models for detecting jailbreak prompts in large language models, enhancing security against sophisticated attacks.
Why It Matters
As large language models become more prevalent, the threat of jailbreak prompts poses significant risks to their integrity. RLM-JB offers a procedural defense that enhances detection capabilities, making it crucial for developers and researchers focused on AI safety and security.
Key Takeaways
- RLM-JB framework effectively detects jailbreak prompts in LLMs.
- Utilizes a procedural approach for enhanced analysis and decision-making.
- Achieves high detection effectiveness with low false positive rates.
Computer Science > Cryptography and Security arXiv:2602.16520 (cs) [Submitted on 18 Feb 2026] Title:Recursive language models for jailbreak detection: a procedural defense for tool-augmented agents Authors:Doron Shavit View a PDF of the paper titled Recursive language models for jailbreak detection: a procedural defense for tool-augmented agents, by Doron Shavit View PDF HTML (experimental) Abstract:Jailbreak prompts are a practical and evolving threat to large language models (LLMs), particularly in agentic systems that execute tools over untrusted content. Many attacks exploit long-context hiding, semantic camouflage, and lightweight obfuscations that can evade single-pass guardrails. We present RLM-JB, an end-to-end jailbreak detection framework built on Recursive Language Models (RLMs), in which a root model orchestrates a bounded analysis program that transforms the input, queries worker models over covered segments, and aggregates evidence into an auditable decision. RLM-JB treats detection as a procedure rather than a one-shot classification: it normalizes and de-obfuscates suspicious inputs, chunks text to reduce context dilution and guarantee coverage, performs parallel chunk screening, and composes cross-chunk signals to recover split-payload attacks. On AutoDAN-style adversarial inputs, RLM-JB achieves high detection effectiveness across three LLM backends (ASR/Recall 92.5-98.0%) while maintaining very high precision (98.99-100%) and low false positive rates (0....