LLM-Guided Prompt Deconstruction

Description: A vulnerability in Text-to-Image (T2I) models' safety filters allows bypassing through the injection of adversarial prompts crafted by an LLM-driven multi-agent system. The attack, named Divide-and-Conquer Attack (DACA), circumvents the filters by rephrasing harmful prompts into multiple benign descriptions of individual visual components, thus avoiding detection while maintaining the original visual intent.

Examples: See the paper's supplementary material for examples of adversarial prompts and their corresponding generated images. Specific examples are also included in Figures 1 and 6 of the paper.

Impact: Successful exploitation allows generation of images containing violent, gory, illegal, discriminatory, or pornographic content, bypassing intended safety mechanisms. The attack is cost-effective, requiring minimal resources to generate and reuse effective adversarial prompts. The success rate is high, reaching near 100% in repeated attacks.

Affected Systems: Text-to-Image models employing LLM-based safety filters, specifically DALL-E 3 and Midjourney V6, are demonstrably affected. Other models using similar safety filter mechanisms may also be vulnerable.

Mitigation Steps:

Implement post-generation image analysis using vision understanding models to detect harmful content in generated images.
Investigate and improve LLM-based safety filters to better detect and flag adversarial prompts. Consider prompt summarization techniques, though this approach needs further research.
Enhance the granularity of the image ontology used within the safety filter to capture more detailed visual elements and associated sensitive terms.

LLM-Guided Prompt Deconstruction

Research Paper