Llms Ai Safety Generative Ai

[2602.16346] Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents

arXiv - Machine Learning February 19, 2026 4 min read Article

Summary

This article presents STING, a framework for evaluating illicit assistance in multi-turn, multilingual LLM agents, highlighting the challenges of agent misuse in complex scenarios.

Why It Matters

As LLMs become more integrated into workflows, understanding their potential for misuse is critical. This research addresses a significant gap in evaluating how these agents can inadvertently assist in harmful tasks, especially in multilingual contexts, making it relevant for developers and policymakers in AI safety.

Key Takeaways

STING framework enables step-by-step probing of LLM agents for illicit task completion.
Multi-turn interactions reveal higher illicit-task success compared to single-turn prompts.
Evaluation shows inconsistent attack success rates across different languages, challenging previous assumptions.

Computer Science > Computation and Language arXiv:2602.16346 (cs) [Submitted on 18 Feb 2026] Title:Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents Authors:Nivya Talokar, Ayush K Tarun, Murari Mandal, Maksym Andriushchenko, Antoine Bosselut View a PDF of the paper titled Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents, by Nivya Talokar and 4 other authors View PDF Abstract:LLM-based agents execute real-world workflows via tools and memory. These affordances enable ill-intended adversaries to also use these agents to carry out complex misuse scenarios. Existing agent misuse benchmarks largely test single-prompt instructions, leaving a gap in measuring how agents end up helping with harmful or illegal tasks over multiple turns. We introduce STING (Sequential Testing of Illicit N-step Goal execution), an automated red-teaming framework that constructs a step-by-step illicit plan grounded in a benign persona and iteratively probes a target agent with adaptive follow-ups, using judge agents to track phase completion. We further introduce an analysis framework that models multi-turn red-teaming as a time-to-first-jailbreak random variable, enabling analysis tools like discovery curves, hazard-ratio attribution by attack language, and a new metric: Restricted Mean Jailbreak Discovery. Across AgentHarm scenarios, STING yields substantially higher illicit-task completion than single-turn prompting and ch...

Read Original Article