Llms Ai Safety Ai Agents Machine Learning

[2602.13379] Unsafer in Many Turns: Benchmarking and Defending Multi-Turn Safety Risks in Tool-Using Agents

arXiv - Machine Learning February 17, 2026 4 min read Article

Summary

This article presents a new benchmark, MT-AgentRisk, for evaluating safety risks in multi-turn interactions of tool-using agents, revealing significant safety degradation and proposing a novel defense mechanism, ToolShield.

Why It Matters

As AI agents become more capable, ensuring their safety during complex interactions is crucial. This research highlights the growing risks associated with multi-turn engagements and provides a framework for assessing and mitigating these risks, which is essential for the future of AI safety.

Key Takeaways

Multi-turn interactions in AI agents introduce significant safety risks.
The MT-AgentRisk benchmark reveals a 16% increase in attack success rates in multi-turn settings.
ToolShield is a proposed defense mechanism that reduces attack success rates by 30% without requiring prior training.
The study emphasizes the need for systematic safety evaluations in AI tools.
This research contributes to the ongoing discourse on AI safety and responsible AI deployment.

Computer Science > Cryptography and Security arXiv:2602.13379 (cs) [Submitted on 13 Feb 2026] Title:Unsafer in Many Turns: Benchmarking and Defending Multi-Turn Safety Risks in Tool-Using Agents Authors:Xu Li, Simon Yu, Minzhou Pan, Yiyou Sun, Bo Li, Dawn Song, Xue Lin, Weiyan Shi View a PDF of the paper titled Unsafer in Many Turns: Benchmarking and Defending Multi-Turn Safety Risks in Tool-Using Agents, by Xu Li and Simon Yu and Minzhou Pan and Yiyou Sun and Bo Li and Dawn Song and Xue Lin and Weiyan Shi View PDF HTML (experimental) Abstract:LLM-based agents are becoming increasingly capable, yet their safety lags behind. This creates a gap between what agents can do and should do. This gap widens as agents engage in multi-turn interactions and employ diverse tools, introducing new risks overlooked by existing benchmarks. To systematically scale safety testing into multi-turn, tool-realistic settings, we propose a principled taxonomy that transforms single-turn harmful tasks into multi-turn attack sequences. Using this taxonomy, we construct MT-AgentRisk (Multi-Turn Agent Risk Benchmark), the first benchmark to evaluate multi-turn tool-using agent safety. Our experiments reveal substantial safety degradation: the Attack Success Rate (ASR) increases by 16% on average across open and closed models in multi-turn settings. To close this gap, we propose ToolShield, a training-free, tool-agnostic, self-exploration defense: when encountering a new tool, the agent autonomously g...

Read Original Article

[2602.13379] Unsafer in Many Turns: Benchmarking and Defending Multi-Turn Safety Risks in Tool-Using Agents

Summary

Why It Matters

Key Takeaways

Related Articles

[2603.16105] Frequency Matters: Fast Model-Agnostic Data Curation for Pruning and Quantization

[2603.09643] MM-tau-p$^2$: Persona-Adaptive Prompting for Robust Multi-Modal Agent Evaluation in Dual-Control Settings

[2603.07339] Agora: Teaching the Skill of Consensus-Finding with AI Personas Grounded in Human Voice

[2602.00185] QUASAR: A Universal Autonomous System for Atomistic Simulation and a Benchmark of Its Capabilities

No comments

Stay updated with AI News