[2602.16943] Mind the GAP: Text Safety Does Not Transfer to Tool-Call Safety in LLM Agents
Summary
This paper explores the discrepancies between text safety and tool-call safety in large language model (LLM) agents, introducing the GAP benchmark to evaluate these differences across various domains.
Why It Matters
As LLMs increasingly interact with real-world systems, understanding the limitations of text safety evaluations is crucial for ensuring responsible AI deployment. This research highlights the need for dedicated safety measures that account for tool-call actions, which can have significant consequences.
Key Takeaways
- Text safety evaluations do not guarantee tool-call safety in LLMs.
- The GAP benchmark reveals significant divergences between text outputs and tool-call actions.
- System prompt wording significantly influences tool-call behavior and safety outcomes.
- Runtime governance contracts reduce information leakage but do not deter forbidden tool-call attempts.
- Dedicated measurement and mitigation strategies are essential for assessing agent behavior.
Computer Science > Artificial Intelligence arXiv:2602.16943 (cs) [Submitted on 18 Feb 2026] Title:Mind the GAP: Text Safety Does Not Transfer to Tool-Call Safety in LLM Agents Authors:Arnold Cartagena, Ariane Teixeira View a PDF of the paper titled Mind the GAP: Text Safety Does Not Transfer to Tool-Call Safety in LLM Agents, by Arnold Cartagena and Ariane Teixeira View PDF HTML (experimental) Abstract:Large language models deployed as agents increasingly interact with external systems through tool calls--actions with real-world consequences that text outputs alone do not carry. Safety evaluations, however, overwhelmingly measure text-level refusal behavior, leaving a critical question unanswered: does alignment that suppresses harmful text also suppress harmful actions? We introduce the GAP benchmark, a systematic evaluation framework that measures divergence between text-level safety and tool-call-level safety in LLM agents. We test six frontier models across six regulated domains (pharmaceutical, financial, educational, employment, legal, and infrastructure), seven jailbreak scenarios per domain, three system prompt conditions (neutral, safety-reinforced, and tool-encouraging), and two prompt variants, producing 17,420 analysis-ready datapoints. Our central finding is that text safety does not transfer to tool-call safety. Across all six models, we observe instances where the model's text output refuses a harmful request while its tool calls simultaneously execute the f...