[2602.21496] Beyond Refusal: Probing the Limits of Agentic Self-Correction for Semantic Sensitive Information
Summary
The paper explores the limitations of self-correction in Large Language Models (LLMs) regarding semantic sensitive information, introducing a framework called SemSIEdit to mitigate risks while maintaining utility.
Why It Matters
As LLMs become more prevalent, understanding their ability to handle sensitive information is crucial for privacy and ethical AI deployment. This research addresses the balance between reducing sensitive information leakage and preserving the model's utility, which is vital for developers and researchers in AI safety.
Key Takeaways
- Introduces SemSIEdit, a framework for agentic self-correction in LLMs.
- Achieves a 34.6% reduction in sensitive information leakage with a marginal utility loss of 9.8%.
- Identifies a Privacy-Utility Pareto Frontier for balancing safety and model performance.
- Reveals a Scale-Dependent Safety Divergence in LLMs based on their capacity.
- Highlights a Reasoning Paradox where deeper inferences can both increase risk and enable safer rewrites.
Computer Science > Artificial Intelligence arXiv:2602.21496 (cs) [Submitted on 25 Feb 2026] Title:Beyond Refusal: Probing the Limits of Agentic Self-Correction for Semantic Sensitive Information Authors:Umid Suleymanov, Zaur Rajabov, Emil Mirzazada, Murat Kantarcioglu View a PDF of the paper titled Beyond Refusal: Probing the Limits of Agentic Self-Correction for Semantic Sensitive Information, by Umid Suleymanov and 3 other authors View PDF HTML (experimental) Abstract:While defenses for structured PII are mature, Large Language Models (LLMs) pose a new threat: Semantic Sensitive Information (SemSI), where models infer sensitive identity attributes, generate reputation-harmful content, or hallucinate potentially wrong information. The capacity of LLMs to self-regulate these complex, context-dependent sensitive information leaks without destroying utility remains an open scientific question. To address this, we introduce SemSIEdit, an inference-time framework where an agentic "Editor" iteratively critiques and rewrites sensitive spans to preserve narrative flow rather than simply refusing to answer. Our analysis reveals a Privacy-Utility Pareto Frontier, where this agentic rewriting reduces leakage by 34.6% across all three SemSI categories while incurring a marginal utility loss of 9.8%. We also uncover a Scale-Dependent Safety Divergence: large reasoning models (e.g., GPT-5) achieve safety through constructive expansion (adding nuance), whereas capacity-constrained model...