[2604.01438] ClawSafety: "Safe" LLMs, Unsafe Agents
About this article
Abstract page for arXiv paper 2604.01438: ClawSafety: "Safe" LLMs, Unsafe Agents
Computer Science > Artificial Intelligence arXiv:2604.01438 (cs) [Submitted on 1 Apr 2026 (v1), last revised 4 Apr 2026 (this version, v2)] Title:ClawSafety: "Safe" LLMs, Unsafe Agents Authors:Bowen Wei, Yunbei Zhang, Jinhao Pan, Kai Mei, Xiao Wang, Jihun Hamm, Ziwei Zhu, Yingqiang Ge View a PDF of the paper titled ClawSafety: "Safe" LLMs, Unsafe Agents, by Bowen Wei and 7 other authors View PDF HTML (experimental) Abstract:Personal AI agents like OpenClaw run with elevated privileges on users' local machines, where a single successful prompt injection can leak credentials, redirect financial transactions, or destroy files. This threat goes well beyond conventional text-level jailbreaks, yet existing safety evaluations fall short: most test models in isolated chat settings, rely on synthetic environments, and do not account for how the agent framework itself shapes safety outcomes. We introduce CLAWSAFETY, a benchmark of 120 adversarial test scenarios organized along three dimensions (harm domain, attack vector, and harmful action type) and grounded in realistic, high-privilege professional workspaces spanning software engineering, finance, healthcare, law, and DevOps. Each test case embeds adversarial content in one of three channels the agent encounters during normal work: workspace skill files, emails from trusted senders, and web pages. We evaluate five frontier LLMs as agent backbones, running 2,520 sandboxed trials across all configurations. Attack success rates (ASR...