[2602.16346] Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents

[2602.16346] Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents

arXiv - Machine Learning 4 min read Article

Summary

This article presents STING, a framework for evaluating illicit assistance in multi-turn, multilingual LLM agents, highlighting the challenges of agent misuse in complex scenarios.

Why It Matters

As LLMs become more integrated into workflows, understanding their potential for misuse is critical. This research addresses a significant gap in evaluating how these agents can inadvertently assist in harmful tasks, especially in multilingual contexts, making it relevant for developers and policymakers in AI safety.

Key Takeaways

  • STING framework enables step-by-step probing of LLM agents for illicit task completion.
  • Multi-turn interactions reveal higher illicit-task success compared to single-turn prompts.
  • Evaluation shows inconsistent attack success rates across different languages, challenging previous assumptions.

Computer Science > Computation and Language arXiv:2602.16346 (cs) [Submitted on 18 Feb 2026] Title:Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents Authors:Nivya Talokar, Ayush K Tarun, Murari Mandal, Maksym Andriushchenko, Antoine Bosselut View a PDF of the paper titled Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents, by Nivya Talokar and 4 other authors View PDF Abstract:LLM-based agents execute real-world workflows via tools and memory. These affordances enable ill-intended adversaries to also use these agents to carry out complex misuse scenarios. Existing agent misuse benchmarks largely test single-prompt instructions, leaving a gap in measuring how agents end up helping with harmful or illegal tasks over multiple turns. We introduce STING (Sequential Testing of Illicit N-step Goal execution), an automated red-teaming framework that constructs a step-by-step illicit plan grounded in a benign persona and iteratively probes a target agent with adaptive follow-ups, using judge agents to track phase completion. We further introduce an analysis framework that models multi-turn red-teaming as a time-to-first-jailbreak random variable, enabling analysis tools like discovery curves, hazard-ratio attribution by attack language, and a new metric: Restricted Mean Jailbreak Discovery. Across AgentHarm scenarios, STING yields substantially higher illicit-task completion than single-turn prompting and ch...

Related Articles

Anthropic Restricts Claude Agent Access Amid AI Automation Boom in Crypto
Llms

Anthropic Restricts Claude Agent Access Amid AI Automation Boom in Crypto

AI Tools & Products · 7 min ·
Is cutting ‘please’ when talking to ChatGPT better for the planet? An expert explains
Llms

Is cutting ‘please’ when talking to ChatGPT better for the planet? An expert explains

AI Tools & Products · 5 min ·
AI Desktop 98 lets you chat with Claude, ChatGPT, and Gemini through a Windows 98-inspired interface
Llms

AI Desktop 98 lets you chat with Claude, ChatGPT, and Gemini through a Windows 98-inspired interface

AI Tools & Products · 3 min ·
Llms

Claude, OpenClaw and the new reality: AI agents are here — and so is the chaos

AI Tools & Products ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime