[2602.06771] AEGIS: Adversarial Target-Guided Retention-Data-Free Robust Concept Erasure from Diffusion Models

[2602.06771] AEGIS: Adversarial Target-Guided Retention-Data-Free Robust Concept Erasure from Diffusion Models

arXiv - Machine Learning 4 min read Article

Summary

The paper presents AEGIS, a novel framework for robust concept erasure in diffusion models, addressing the trade-off between robustness and retention without requiring additional data.

Why It Matters

As AI systems increasingly generate content, ensuring that harmful concepts can be effectively erased while maintaining model utility is crucial. AEGIS offers a solution that balances these needs, enhancing the safety and effectiveness of generative models.

Key Takeaways

  • AEGIS introduces a retention-data-free approach to concept erasure.
  • The framework improves both robustness against reactivation and retention of unrelated concepts.
  • It addresses limitations of previous methods that compromised one aspect for the other.

Computer Science > Machine Learning arXiv:2602.06771 (cs) [Submitted on 6 Feb 2026 (v1), last revised 13 Feb 2026 (this version, v2)] Title:AEGIS: Adversarial Target-Guided Retention-Data-Free Robust Concept Erasure from Diffusion Models Authors:Fengpeng Li, Kemou Li, Qizhou Wang, Bo Han, Jiantao Zhou View a PDF of the paper titled AEGIS: Adversarial Target-Guided Retention-Data-Free Robust Concept Erasure from Diffusion Models, by Fengpeng Li and Kemou Li and Qizhou Wang and Bo Han and Jiantao Zhou View PDF HTML (experimental) Abstract:Concept erasure helps stop diffusion models (DMs) from generating harmful content; but current methods face robustness retention trade off. Robustness means the model fine-tuned by concept erasure methods resists reactivation of erased concepts, even under semantically related prompts. Retention means unrelated concepts are preserved so the model's overall utility stays intact. Both are critical for concept erasure in practice, yet addressing them simultaneously is challenging, as existing works typically improve one factor while sacrificing the other. Prior work typically strengthens one while degrading the other, e.g., mapping a single erased prompt to a fixed safe target leaves class level remnants exploitable by prompt attacks, whereas retention-oriented schemes underperform against adaptive adversaries. This paper introduces Adversarial Erasure with Gradient Informed Synergy (AEGIS), a retention-data-free framework that advances both r...

Related Articles

Llms

[D] The Bitter Lesson of Optimization: Why training Neural Networks to update themselves is mathematically brutal (but probably inevitable)

Are we still stuck in the "feature engineering" era of optimization? We trust neural networks to learn unimaginably complex patterns from...

Reddit - Machine Learning · 1 min ·
Machine Learning

Why would Anthropic keep a cyber model like Project Glasswing invite-only?

Anthropic’s Project Glasswing caught my attention less as a cybersecurity headline than as a signal about how frontier AI may be commerci...

Reddit - Artificial Intelligence · 1 min ·
Anthropic Teams Up With Its Rivals to Keep AI From Hacking Everything
Llms

Anthropic Teams Up With Its Rivals to Keep AI From Hacking Everything

The AI lab's Project Glasswing will bring together Apple, Google, and more than 45 other organizations. They'll use the new Claude Mythos...

Wired - AI · 7 min ·
Anthropic limits Mythos AI rollout over fears hackers could use model for cyberattacks
Machine Learning

Anthropic limits Mythos AI rollout over fears hackers could use model for cyberattacks

AI Tools & Products · 5 min ·
More in Machine Learning: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime