LLM-Guided Prompt Deconstruction
Research Paper
Divide-and-Conquer Attack: Harnessing the Power of LLM to Bypass the Censorship of Text-to-Image Generation Model
View PaperDescription: A vulnerability in Text-to-Image (T2I) models' safety filters allows bypassing through the injection of adversarial prompts crafted by an LLM-driven multi-agent system. The attack, named Divide-and-Conquer Attack (DACA), circumvents the filters by rephrasing harmful prompts into multiple benign descriptions of individual visual components, thus avoiding detection while maintaining the original visual intent.
Examples: See the paper's supplementary material for examples of adversarial prompts and their corresponding generated images. Specific examples are also included in Figures 1 and 6 of the paper.
Impact: Successful exploitation allows generation of images containing violent, gory, illegal, discriminatory, or pornographic content, bypassing intended safety mechanisms. The attack is cost-effective, requiring minimal resources to generate and reuse effective adversarial prompts. The success rate is high, reaching near 100% in repeated attacks.
Affected Systems: Text-to-Image models employing LLM-based safety filters, specifically DALL-E 3 and Midjourney V6, are demonstrably affected. Other models using similar safety filter mechanisms may also be vulnerable.
Mitigation Steps:
- Implement post-generation image analysis using vision understanding models to detect harmful content in generated images.
- Investigate and improve LLM-based safety filters to better detect and flag adversarial prompts. Consider prompt summarization techniques, though this approach needs further research.
- Enhance the granularity of the image ontology used within the safety filter to capture more detailed visual elements and associated sensitive terms.
© 2025 Promptfoo. All rights reserved.