[2602.15689] A Content-Based Framework for Cybersecurity Refusal Decisions in Large Language Models
Summary
This paper presents a content-based framework for cybersecurity refusal decisions in large language models, emphasizing the need for explicit modeling of offensive risks and defensive benefits.
Why It Matters
As large language models are increasingly utilized in cybersecurity, establishing robust refusal policies is crucial. This framework addresses inconsistencies in current systems by focusing on the technical content of requests, thereby enhancing decision-making in dual-use scenarios.
Key Takeaways
- Current refusal policies often yield inconsistent decisions.
- A content-based approach can improve the reliability of refusal decisions.
- The framework evaluates requests based on five critical dimensions.
- Explicit modeling of offense-defense trade-offs enhances policy effectiveness.
- Organizations can create tunable, risk-aware refusal policies using this framework.
Computer Science > Computation and Language arXiv:2602.15689 (cs) [Submitted on 17 Feb 2026] Title:A Content-Based Framework for Cybersecurity Refusal Decisions in Large Language Models Authors:Meirav Segal, Noa Linder, Omer Antverg, Gil Gekker, Tomer Fichman, Omri Bodenheimer, Edan Maor, Omer Nevo View a PDF of the paper titled A Content-Based Framework for Cybersecurity Refusal Decisions in Large Language Models, by Meirav Segal and 7 other authors View PDF HTML (experimental) Abstract:Large language models and LLM-based agents are increasingly used for cybersecurity tasks that are inherently dual-use. Existing approaches to refusal, spanning academic policy frameworks and commercially deployed systems, often rely on broad topic-based bans or offensive-focused taxonomies. As a result, they can yield inconsistent decisions, over-restrict legitimate defenders, and behave brittlely under obfuscation or request segmentation. We argue that effective refusal requires explicitly modeling the trade-off between offensive risk and defensive benefit, rather than relying solely on intent or offensive classification. In this paper, we introduce a content-based framework for designing and auditing cyber refusal policies that makes offense-defense tradeoffs explicit. The framework characterizes requests along five dimensions: Offensive Action Contribution, Offensive Risk, Technical Complexity, Defensive Benefit, and Expected Frequency for Legitimate Users, grounded in the technical sub...