Multi-Agent T2I Jailbreak
Research Paper
Jailbreaking text-to-image models with llm-based agents
View PaperDescription: A vulnerability allows bypassing safety filters in text-to-image (T2I) models using a multi-agent framework ("Atlas") powered by Large Language Models (LLMs). Atlas iteratively generates and refines prompts, leveraging a Vision-Language Model (VLM) to assess filter activation and an LLM to select effective prompts that maintain semantic similarity to the original, malicious prompt while evading the filter. This enables the generation of images containing unsafe content.
Examples: Specific examples of successful jailbreak prompts and resulting images are available in the associated research paper. See arXiv:2405.18540.
Impact: Successful exploitation allows bypassing safety mechanisms in T2I models, leading to the generation of unsafe content such as NSFW images, violent content, or images violating terms of service. This impacts the safety and ethical implications of using T2I models.
Affected Systems: Multiple state-of-the-art text-to-image models (Stable Diffusion v1.4, Stable Diffusion XL, Stable Diffusion 3, DALL-E 3) with various safety filters are affected. The vulnerability is demonstrated across various types of safety filters (text-based, image-based, text-image-based) showing wide applicability.
Mitigation Steps:
- Enhance safety filter training with adversarial examples generated by similar LLM-based agents.
- Implement robust prompt sanitization techniques designed to detect and mitigate iterative prompt manipulation strategies.
- Employ rate limiting or account suspension mechanisms for users exhibiting suspicious prompt patterns indicative of jailbreaking attempts.
- Explore and develop certification mechanisms for T2I Models to verify robustness against prompt-based attacks.
© 2025 Promptfoo. All rights reserved.