[2602.06771] AEGIS: Adversarial Target-Guided Retention-Data-Free Robust Concept Erasure from Diffusion Models
Summary
The paper presents AEGIS, a novel framework for robust concept erasure in diffusion models, addressing the trade-off between robustness and retention without requiring additional data.
Why It Matters
As AI systems increasingly generate content, ensuring that harmful concepts can be effectively erased while maintaining model utility is crucial. AEGIS offers a solution that balances these needs, enhancing the safety and effectiveness of generative models.
Key Takeaways
- AEGIS introduces a retention-data-free approach to concept erasure.
- The framework improves both robustness against reactivation and retention of unrelated concepts.
- It addresses limitations of previous methods that compromised one aspect for the other.
Computer Science > Machine Learning arXiv:2602.06771 (cs) [Submitted on 6 Feb 2026 (v1), last revised 13 Feb 2026 (this version, v2)] Title:AEGIS: Adversarial Target-Guided Retention-Data-Free Robust Concept Erasure from Diffusion Models Authors:Fengpeng Li, Kemou Li, Qizhou Wang, Bo Han, Jiantao Zhou View a PDF of the paper titled AEGIS: Adversarial Target-Guided Retention-Data-Free Robust Concept Erasure from Diffusion Models, by Fengpeng Li and Kemou Li and Qizhou Wang and Bo Han and Jiantao Zhou View PDF HTML (experimental) Abstract:Concept erasure helps stop diffusion models (DMs) from generating harmful content; but current methods face robustness retention trade off. Robustness means the model fine-tuned by concept erasure methods resists reactivation of erased concepts, even under semantically related prompts. Retention means unrelated concepts are preserved so the model's overall utility stays intact. Both are critical for concept erasure in practice, yet addressing them simultaneously is challenging, as existing works typically improve one factor while sacrificing the other. Prior work typically strengthens one while degrading the other, e.g., mapping a single erased prompt to a fixed safe target leaves class level remnants exploitable by prompt attacks, whereas retention-oriented schemes underperform against adaptive adversaries. This paper introduces Adversarial Erasure with Gradient Informed Synergy (AEGIS), a retention-data-free framework that advances both r...