Dynamic Prompt Jailbreak
Research Paper
GhostPrompt: Jailbreaking Text-to-image Generative Models based on Dynamic Optimization
View PaperDescription: GhostPrompt demonstrates a vulnerability in multimodal safety filters used with text-to-image generative models. The vulnerability allows attackers to bypass these filters by using a dynamic prompt optimization framework that iteratively generates adversarial prompts designed to evade both text-based and image-based safety checks while preserving the original, harmful intent of the prompt. This bypass is achieved through a combination of semantically aligned prompt rewriting and the injection of benign visual cues to confuse image-level filters.
Examples: See arXiv:2405.18540 for examples of adversarial prompts and the generated images. Specific examples are included in Appendix A of the linked paper and available via controlled access.
Impact: Successful exploitation of this vulnerability allows attackers to generate and distribute NSFW images, including those depicting violence, gore, hate speech, or other harmful content, despite the presence of safety filters designed to prevent the generation of such material. This undermines the effectiveness of current safety measures and could have severe implications for online safety and the spread of harmful content.
Affected Systems: Text-to-image generative models employing large language model (LLM)-based text safety filters and CLIP-based or similar image safety filters, including but not limited to Stable Diffusion, DALL-E 3, and models employing ShieldLM-7B, GPT-4.1, DeepSeek-V3, and InternVL2-2B.
Mitigation Steps:
- Implement dynamic defense strategies that incorporate semantic drift detection to identify anomalies between input prompts and generated images.
- Use multi-stage filtering, cascading LLM-based text filters with vision-language alignment models.
- Incorporate adversarial examples, generated through methods similar to GhostPrompt, into safety filter training data.
- Implement content provenance standards (e.g., C2PA watermarking) to track the origin and authenticity of AI-generated images.
- Develop real-time adversarial attack detection mechanisms.
© 2025 Promptfoo. All rights reserved.