Metaphor-Based T2I Jailbreak
Research Paper
Metaphor-based Jailbreaking Attacks on Text-to-Image Models
View PaperDescription: A vulnerability in text-to-image (T2I) models allows bypassing safety filters through the use of metaphor-based adversarial prompts. These prompts, crafted using LLMs, indirectly convey sensitive content, exploiting the model's ability to infer meaning from figurative language while circumventing explicit keyword filters and model editing strategies.
Examples: See the provided research paper for examples of metaphor-based adversarial prompts that successfully bypass safety filters on various T2I models. Specific examples are included within the paper's experimental results and visualization sections.
Impact: Successful exploitation allows generation of images depicting sensitive content (e.g., violence, sexual imagery, illegal activities) that would normally be blocked by the T2I model’s safety mechanisms. This compromises the intended safety and responsible use of the model, potentially leading to the creation and dissemination of harmful or illegal material.
Affected Systems: Various open-source and commercial text-to-image models, including but not limited to Stable Diffusion (v1.4, XL), Flux, DALL-E 3, and Midjourney, are susceptible if their safety mechanisms rely on keyword filtering or similar methods. The vulnerability affects systems using these models where their safety filters are not sufficiently robust against metaphorical language.
Mitigation Steps:
- Enhance safety filters: Implement safety filters that are not solely reliant on keyword blacklists. Incorporate techniques to detect and block prompts conveying sensitive content through metaphor, analogy, or other indirect means. This may include advanced semantic analysis or the use of more robust prompt classifiers.
- Improve model training: Develop more robust training methodologies to improve the model's resistance against adversarial prompts, making it less susceptible to misinterpreting or ignoring the implications of indirectly expressed sensitive content.
- Regular security audits: Conduct thorough security testing and audits of T2I models to identify and address vulnerabilities related to prompt engineering and safety filter circumvention.
- Human-in-the-loop review: Incorporate human review of generated images, especially when dealing with sensitive prompts or potentially harmful content, to act as a final safeguard against filter bypasses.
© 2025 Promptfoo. All rights reserved.