[2605.04785] AgentTrust: Runtime Safety Evaluation and Interception for AI Agent Tool Use
About this article
Abstract page for arXiv paper 2605.04785: AgentTrust: Runtime Safety Evaluation and Interception for AI Agent Tool Use
Computer Science > Artificial Intelligence arXiv:2605.04785 (cs) [Submitted on 6 May 2026] Title:AgentTrust: Runtime Safety Evaluation and Interception for AI Agent Tool Use Authors:Chenglin Yang View a PDF of the paper titled AgentTrust: Runtime Safety Evaluation and Interception for AI Agent Tool Use, by Chenglin Yang View PDF HTML (experimental) Abstract:Modern AI agents execute real-world side effects through tool calls such as file operations, shell commands, HTTP requests, and database queries. A single unsafe action, including accidental deletion, credential exposure, or data exfiltration, can cause irreversible harm. Existing defenses are incomplete: post-hoc benchmarks measure behavior after execution, static guardrails miss obfuscation and multi-step context, and infrastructure sandboxes constrain where code runs without understanding what an action means. We present AgentTrust, a runtime safety layer that intercepts agent tool calls before execution and returns a structured verdict: allow, warn, block, or review. AgentTrust combines a shell deobfuscation normalizer, SafeFix suggestions for safer alternatives, RiskChain detection for multi-step attack chains, and a cache-aware LLM-as-Judge for ambiguous inputs. We release a 300-scenario benchmark across six risk categories and an additional 630 independently constructed real-world adversarial scenarios. On the internal benchmark, the production-only ruleset achieves 95.0% verdict accuracy and 73.7% risk-level acc...