[2503.00187] Steering Dialogue Dynamics for Robustness against Multi-turn Jailbreaking Attacks

[2503.00187] Steering Dialogue Dynamics for Robustness against Multi-turn Jailbreaking Attacks

arXiv - Machine Learning 4 min read Article

Summary

This paper presents a safety steering framework to enhance the robustness of large language models (LLMs) against multi-turn jailbreaking attacks, which exploit contextual drift in dialogues.

Why It Matters

As LLMs are increasingly deployed in real-world applications, their vulnerability to adversarial attacks poses significant risks. This research addresses a critical gap in existing defenses by proposing a proactive approach that ensures safety across multiple dialogue turns, thus enhancing the reliability of AI systems in sensitive contexts.

Key Takeaways

  • Current defenses against LLM vulnerabilities are inadequate for multi-turn dialogues.
  • The proposed neural barrier function (NBF) effectively detects harmful queries in evolving contexts.
  • The safety steering framework ensures invariant safety, balancing safety and helpfulness.
  • Extensive experiments demonstrate superior performance compared to existing safety measures.
  • This research contributes to the broader field of AI safety and robustness.

Computer Science > Computation and Language arXiv:2503.00187 (cs) [Submitted on 28 Feb 2025 (v1), last revised 16 Feb 2026 (this version, v3)] Title:Steering Dialogue Dynamics for Robustness against Multi-turn Jailbreaking Attacks Authors:Hanjiang Hu, Alexander Robey, Changliu Liu View a PDF of the paper titled Steering Dialogue Dynamics for Robustness against Multi-turn Jailbreaking Attacks, by Hanjiang Hu and 2 other authors View PDF Abstract:Large language models (LLMs) are shown to be vulnerable to jailbreaking attacks where adversarial prompts are designed to elicit harmful responses. While existing defenses effectively mitigate single-turn attacks by detecting and filtering unsafe inputs, they fail against multi-turn jailbreaks that exploit contextual drift over multiple interactions, gradually leading LLMs away from safe behavior. To address this challenge, we propose a safety steering framework grounded in safe control theory, ensuring invariant safety in multi-turn dialogues. Our approach models the dialogue with LLMs using state-space representations and introduces a novel neural barrier function (NBF) to detect and filter harmful queries emerging from evolving contexts proactively. Our method achieves invariant safety at each turn of dialogue by learning a safety predictor that accounts for adversarial queries, preventing potential context drift toward jailbreaks. Extensive experiments under multiple LLMs show that our NBF-based safety steering outperforms safet...

Related Articles

Iran threatens ‘complete and utter annihilation’ of OpenAI's $30B Stargate AI data center in Abu Dhabi — regime posts video with satellite imagery of ChatGPT-maker's premier 1GW data center
Llms

Iran threatens ‘complete and utter annihilation’ of OpenAI's $30B Stargate AI data center in Abu Dhabi — regime posts video with satellite imagery of ChatGPT-maker's premier 1GW data center

Iran's Islamic Revolutionary Guard Corps (IRGC) issued this specific threat in a video update.

AI Tools & Products · 5 min ·
AI Desktop 98 lets you chat with Claude, ChatGPT, and Gemini through a Windows 98-inspired interface
Llms

AI Desktop 98 lets you chat with Claude, ChatGPT, and Gemini through a Windows 98-inspired interface

AI Tools & Products · 3 min ·
Anthropic Restricts Claude Agent Access Amid AI Automation Boom in Crypto
Llms

Anthropic Restricts Claude Agent Access Amid AI Automation Boom in Crypto

Anthropic cut Claude subscription access for Openclaw on April 4, pushing crypto AI agent users to pay-as-you-go billing.

AI Tools & Products · 7 min ·
I hit Claude’s new usage limits — and It changed how I use AI forever
Llms

I hit Claude’s new usage limits — and It changed how I use AI forever

Claude's message limits are dynamic, meaning they change based on site demand which is why I recommend using "Mega-Prompts" and utilizing...

AI Tools & Products · 8 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime