[2601.03213] Critic-Guided Reinforcement Unlearning in Text-to-Image Diffusion
Summary
The paper presents a novel reinforcement learning framework for unlearning targeted concepts in text-to-image diffusion models, enhancing stability and image quality.
Why It Matters
As machine learning models increasingly handle sensitive data, the ability to effectively 'unlearn' specific information is crucial for privacy and compliance. This research contributes to that goal by improving the efficiency and effectiveness of unlearning methods in generative models, which is relevant for developers and researchers in AI safety and ethics.
Key Takeaways
- Introduces a reinforcement learning framework for diffusion unlearning.
- Utilizes a timestep-aware critic to improve stability and performance.
- Achieves better forgetting of concepts while maintaining image quality.
- Supports off-policy reuse, making it easy to implement.
- Releases code for reproducibility, aiding future research.
Computer Science > Machine Learning arXiv:2601.03213 (cs) [Submitted on 6 Jan 2026 (v1), last revised 15 Feb 2026 (this version, v3)] Title:Critic-Guided Reinforcement Unlearning in Text-to-Image Diffusion Authors:Mykola Vysotskyi, Zahar Kohut, Mariia Shpir, Taras Rumezhak, Volodymyr Karpiv View a PDF of the paper titled Critic-Guided Reinforcement Unlearning in Text-to-Image Diffusion, by Mykola Vysotskyi and 4 other authors View PDF HTML (experimental) Abstract:Machine unlearning in text-to-image diffusion models aims to remove targeted concepts while preserving overall utility. Prior diffusion unlearning methods typically rely on supervised weight edits or global penalties; reinforcement-learning (RL) approaches, while flexible, often optimize sparse end-of-trajectory rewards, yielding high-variance updates and weak credit assignment. We present a general RL framework for diffusion unlearning that treats denoising as a sequential decision process and introduces a timestep-aware critic with noisy-step rewards. Concretely, we train a CLIP-based reward predictor on noisy latents and use its per-step signal to compute advantage estimates for policy-gradient updates of the reverse diffusion kernel. Our algorithm is simple to implement, supports off-policy reuse, and plugs into standard text-to-image backbones. Across multiple concepts, the method achieves better or comparable forgetting to strong baselines while maintaining image quality and benign prompt fidelity; ablations...