[2602.21496] Beyond Refusal: Probing the Limits of Agentic Self-Correction for Semantic Sensitive Information

[2602.21496] Beyond Refusal: Probing the Limits of Agentic Self-Correction for Semantic Sensitive Information

arXiv - AI 3 min read Article

Summary

The paper explores the limitations of self-correction in Large Language Models (LLMs) regarding semantic sensitive information, introducing a framework called SemSIEdit to mitigate risks while maintaining utility.

Why It Matters

As LLMs become more prevalent, understanding their ability to handle sensitive information is crucial for privacy and ethical AI deployment. This research addresses the balance between reducing sensitive information leakage and preserving the model's utility, which is vital for developers and researchers in AI safety.

Key Takeaways

  • Introduces SemSIEdit, a framework for agentic self-correction in LLMs.
  • Achieves a 34.6% reduction in sensitive information leakage with a marginal utility loss of 9.8%.
  • Identifies a Privacy-Utility Pareto Frontier for balancing safety and model performance.
  • Reveals a Scale-Dependent Safety Divergence in LLMs based on their capacity.
  • Highlights a Reasoning Paradox where deeper inferences can both increase risk and enable safer rewrites.

Computer Science > Artificial Intelligence arXiv:2602.21496 (cs) [Submitted on 25 Feb 2026] Title:Beyond Refusal: Probing the Limits of Agentic Self-Correction for Semantic Sensitive Information Authors:Umid Suleymanov, Zaur Rajabov, Emil Mirzazada, Murat Kantarcioglu View a PDF of the paper titled Beyond Refusal: Probing the Limits of Agentic Self-Correction for Semantic Sensitive Information, by Umid Suleymanov and 3 other authors View PDF HTML (experimental) Abstract:While defenses for structured PII are mature, Large Language Models (LLMs) pose a new threat: Semantic Sensitive Information (SemSI), where models infer sensitive identity attributes, generate reputation-harmful content, or hallucinate potentially wrong information. The capacity of LLMs to self-regulate these complex, context-dependent sensitive information leaks without destroying utility remains an open scientific question. To address this, we introduce SemSIEdit, an inference-time framework where an agentic "Editor" iteratively critiques and rewrites sensitive spans to preserve narrative flow rather than simply refusing to answer. Our analysis reveals a Privacy-Utility Pareto Frontier, where this agentic rewriting reduces leakage by 34.6% across all three SemSI categories while incurring a marginal utility loss of 9.8%. We also uncover a Scale-Dependent Safety Divergence: large reasoning models (e.g., GPT-5) achieve safety through constructive expansion (adding nuance), whereas capacity-constrained model...

Related Articles

Llms

AI Has Broken the Internet

So the web has been breaking a lot lately. Vercel is down. GitHub is down. Claude is down. Cloudflare is down. AWS is down. Everything is...

Reddit - Artificial Intelligence · 1 min ·
Llms

LLM agents can trigger real actions now. But what actually stops them from executing?

We ran into a simple but important issue while building agents with tool calling: the model can propose actions but nothing actually enfo...

Reddit - Artificial Intelligence · 1 min ·
Llms

Are LLMs a Dead End? (Investors Just Bet $1 Billion on “Yes”)

| AI Reality Check | Cal Newport Chapters 0:00 What is Yan LeCun Up To? 14:55 How is it possible that LeCun could be right about LLM’s be...

Reddit - Artificial Intelligence · 1 min ·
Mercor says it was hit by cyberattack tied to compromise of open-source LiteLLM project | TechCrunch
Llms

Mercor says it was hit by cyberattack tied to compromise of open-source LiteLLM project | TechCrunch

The AI recruiting startup confirmed a security incident after an extortion hacking crew took credit for stealing data from the company's ...

TechCrunch - AI · 4 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime