LMVD-ID: 1b604461
Published December 1, 2023

LLM-Guided Prompt Deconstruction

Affected Models:dall·e 3, midjourney v6, gpt-4, gpt-3.5-turbo, spark v3.0, chatglm-turbo, qwen-14b, qwen-max

Research Paper

Divide-and-Conquer Attack: Harnessing the Power of LLM to Bypass the Censorship of Text-to-Image Generation Model

View Paper

Description: A vulnerability in Text-to-Image (T2I) models' safety filters allows bypassing through the injection of adversarial prompts crafted by an LLM-driven multi-agent system. The attack, named Divide-and-Conquer Attack (DACA), circumvents the filters by rephrasing harmful prompts into multiple benign descriptions of individual visual components, thus avoiding detection while maintaining the original visual intent.

Examples: See the paper's supplementary material for examples of adversarial prompts and their corresponding generated images. Specific examples are also included in Figures 1 and 6 of the paper.

Impact: Successful exploitation allows generation of images containing violent, gory, illegal, discriminatory, or pornographic content, bypassing intended safety mechanisms. The attack is cost-effective, requiring minimal resources to generate and reuse effective adversarial prompts. The success rate is high, reaching near 100% in repeated attacks.

Affected Systems: Text-to-Image models employing LLM-based safety filters, specifically DALL-E 3 and Midjourney V6, are demonstrably affected. Other models using similar safety filter mechanisms may also be vulnerable.

Mitigation Steps:

  • Implement post-generation image analysis using vision understanding models to detect harmful content in generated images.
  • Investigate and improve LLM-based safety filters to better detect and flag adversarial prompts. Consider prompt summarization techniques, though this approach needs further research.
  • Enhance the granularity of the image ontology used within the safety filter to capture more detailed visual elements and associated sensitive terms.

© 2025 Promptfoo. All rights reserved.