[2602.16520] Recursive language models for jailbreak detection: a procedural defense for tool-augmented agents

[2602.16520] Recursive language models for jailbreak detection: a procedural defense for tool-augmented agents

arXiv - AI 3 min read Article

Summary

The paper presents RLM-JB, a framework utilizing Recursive Language Models for detecting jailbreak prompts in large language models, enhancing security against sophisticated attacks.

Why It Matters

As large language models become more prevalent, the threat of jailbreak prompts poses significant risks to their integrity. RLM-JB offers a procedural defense that enhances detection capabilities, making it crucial for developers and researchers focused on AI safety and security.

Key Takeaways

  • RLM-JB framework effectively detects jailbreak prompts in LLMs.
  • Utilizes a procedural approach for enhanced analysis and decision-making.
  • Achieves high detection effectiveness with low false positive rates.

Computer Science > Cryptography and Security arXiv:2602.16520 (cs) [Submitted on 18 Feb 2026] Title:Recursive language models for jailbreak detection: a procedural defense for tool-augmented agents Authors:Doron Shavit View a PDF of the paper titled Recursive language models for jailbreak detection: a procedural defense for tool-augmented agents, by Doron Shavit View PDF HTML (experimental) Abstract:Jailbreak prompts are a practical and evolving threat to large language models (LLMs), particularly in agentic systems that execute tools over untrusted content. Many attacks exploit long-context hiding, semantic camouflage, and lightweight obfuscations that can evade single-pass guardrails. We present RLM-JB, an end-to-end jailbreak detection framework built on Recursive Language Models (RLMs), in which a root model orchestrates a bounded analysis program that transforms the input, queries worker models over covered segments, and aggregates evidence into an auditable decision. RLM-JB treats detection as a procedure rather than a one-shot classification: it normalizes and de-obfuscates suspicious inputs, chunks text to reduce context dilution and guarantee coverage, performs parallel chunk screening, and composes cross-chunk signals to recover split-payload attacks. On AutoDAN-style adversarial inputs, RLM-JB achieves high detection effectiveness across three LLM backends (ASR/Recall 92.5-98.0%) while maintaining very high precision (98.99-100%) and low false positive rates (0....

Related Articles

How to use the new ChatGPT app integrations, including DoorDash, Spotify, Uber, and others | TechCrunch
Llms

How to use the new ChatGPT app integrations, including DoorDash, Spotify, Uber, and others | TechCrunch

Learn how to use Spotify, Canva, Figma, Expedia, and other apps directly in ChatGPT.

TechCrunch - AI · 10 min ·
Anthropic Restricts Claude Agent Access Amid AI Automation Boom in Crypto
Llms

Anthropic Restricts Claude Agent Access Amid AI Automation Boom in Crypto

AI Tools & Products · 7 min ·
Is cutting ‘please’ when talking to ChatGPT better for the planet? An expert explains
Llms

Is cutting ‘please’ when talking to ChatGPT better for the planet? An expert explains

AI Tools & Products · 5 min ·
AI Desktop 98 lets you chat with Claude, ChatGPT, and Gemini through a Windows 98-inspired interface
Llms

AI Desktop 98 lets you chat with Claude, ChatGPT, and Gemini through a Windows 98-inspired interface

AI Tools & Products · 3 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime