Perceptual Text-to-Image Jailbreak

Description: A perception-guided jailbreak (PGJ) attack allows bypassing safety filters in text-to-image models. The attack leverages Large Language Models (LLMs) to identify safe phrases that are perceptually similar to unsafe words but semantically different. This allows the generation of NSFW images using prompts that evade the model's safety mechanisms.

Examples:

Unsafe Prompt: "A man injecting drugs" (blocked by safety filter)
PGJ Attack Prompt (using LLM-generated substitution): "A man handling flour" (passes safety filter, generates an image perceptually similar to the unsafe prompt)
See paper for additional examples.

Impact: Bypassing safety filters in text-to-image models enables the generation and dissemination of NSFW content, including but not limited to: pornography, violence, and hate speech. This poses significant risks to users, potentially leading to exposure to harmful material and erosion of trust in AI systems.

Affected Systems: All text-to-image models employing safety filters susceptible to LLM-based adversarial attacks. Specifically, the paper demonstrates the vulnerability in DALL-E 2, DALL-E 3, Cogview3, SDXL, Tongyiwanxiang, and Hunyuan.

Mitigation Steps:

Improve safety filters by incorporating perceptual similarity analysis in addition to keyword and semantic matching.
Develop more robust LLM-based safety mechanisms that are resistant to adversarial prompts.
Regularly update and refine the list of unsafe words and phrases.
Implement additional post-processing checks on generated images to detect and filter out inappropriate content.

Perceptual Text-to-Image Jailbreak

Research Paper