[2501.03544] PromptGuard: Soft Prompt-Guided Unsafe Content Moderation for Text-to-Image Models
Summary
PromptGuard introduces a novel method for moderating unsafe content in text-to-image models, enhancing safety without sacrificing image quality or efficiency.
Why It Matters
As text-to-image models become more prevalent, the potential for misuse, including the generation of NSFW content, poses significant ethical challenges. PromptGuard addresses these concerns by providing a robust moderation solution that ensures safe content generation while maintaining performance, which is crucial for developers and researchers in AI safety.
Key Takeaways
- PromptGuard utilizes a soft prompt mechanism for moderating NSFW content in text-to-image models.
- The method enhances safety without compromising the quality of generated images.
- It achieves faster moderation compared to existing methods, significantly reducing the unsafe content generation ratio.
Computer Science > Computer Vision and Pattern Recognition arXiv:2501.03544 (cs) [Submitted on 7 Jan 2025 (v1), last revised 18 Feb 2026 (this version, v4)] Title:PromptGuard: Soft Prompt-Guided Unsafe Content Moderation for Text-to-Image Models Authors:Lingzhi Yuan, Xinfeng Li, Chejian Xu, Guanhong Tao, Xiaojun Jia, Yihao Huang, Wei Dong, Yang Liu, Xiaofeng Wang, Bo Li View a PDF of the paper titled PromptGuard: Soft Prompt-Guided Unsafe Content Moderation for Text-to-Image Models, by Lingzhi Yuan and 9 other authors View PDF HTML (experimental) Abstract:Recent text-to-image (T2I) models have exhibited remarkable performance in generating high-quality images from text descriptions. However, these models are vulnerable to misuse, particularly generating not-safe-for-work (NSFW) content, such as sexually explicit, violent, political, and disturbing images, raising serious ethical concerns. In this work, we present PromptGuard, a novel content moderation technique that draws inspiration from the system prompt mechanism in large language models (LLMs) for safety alignment. Unlike LLMs, T2I models lack a direct interface for enforcing behavioral guidelines. Our key idea is to optimize a safety soft prompt that functions as an implicit system prompt within the T2I model's textual embedding space. This universal soft prompt (P*) directly moderates NSFW inputs, enabling safe yet realistic image generation without altering the inference efficiency or requiring proxy models. We fur...