[2602.13379] Unsafer in Many Turns: Benchmarking and Defending Multi-Turn Safety Risks in Tool-Using Agents
Summary
This article presents a new benchmark, MT-AgentRisk, for evaluating safety risks in multi-turn interactions of tool-using agents, revealing significant safety degradation and proposing a novel defense mechanism, ToolShield.
Why It Matters
As AI agents become more capable, ensuring their safety during complex interactions is crucial. This research highlights the growing risks associated with multi-turn engagements and provides a framework for assessing and mitigating these risks, which is essential for the future of AI safety.
Key Takeaways
- Multi-turn interactions in AI agents introduce significant safety risks.
- The MT-AgentRisk benchmark reveals a 16% increase in attack success rates in multi-turn settings.
- ToolShield is a proposed defense mechanism that reduces attack success rates by 30% without requiring prior training.
- The study emphasizes the need for systematic safety evaluations in AI tools.
- This research contributes to the ongoing discourse on AI safety and responsible AI deployment.
Computer Science > Cryptography and Security arXiv:2602.13379 (cs) [Submitted on 13 Feb 2026] Title:Unsafer in Many Turns: Benchmarking and Defending Multi-Turn Safety Risks in Tool-Using Agents Authors:Xu Li, Simon Yu, Minzhou Pan, Yiyou Sun, Bo Li, Dawn Song, Xue Lin, Weiyan Shi View a PDF of the paper titled Unsafer in Many Turns: Benchmarking and Defending Multi-Turn Safety Risks in Tool-Using Agents, by Xu Li and Simon Yu and Minzhou Pan and Yiyou Sun and Bo Li and Dawn Song and Xue Lin and Weiyan Shi View PDF HTML (experimental) Abstract:LLM-based agents are becoming increasingly capable, yet their safety lags behind. This creates a gap between what agents can do and should do. This gap widens as agents engage in multi-turn interactions and employ diverse tools, introducing new risks overlooked by existing benchmarks. To systematically scale safety testing into multi-turn, tool-realistic settings, we propose a principled taxonomy that transforms single-turn harmful tasks into multi-turn attack sequences. Using this taxonomy, we construct MT-AgentRisk (Multi-Turn Agent Risk Benchmark), the first benchmark to evaluate multi-turn tool-using agent safety. Our experiments reveal substantial safety degradation: the Attack Success Rate (ASR) increases by 16% on average across open and closed models in multi-turn settings. To close this gap, we propose ToolShield, a training-free, tool-agnostic, self-exploration defense: when encountering a new tool, the agent autonomously g...