Semantic Confusion Jailbreak

Description: The Antelope attack exploits vulnerabilities in Text-to-Image (T2I) models' safety filters by crafting adversarial prompts. These prompts, while appearing benign, induce the generation of NSFW images by leveraging semantic similarity between harmless and harmful concepts. The attack involves replacing explicit terms in an original prompt with seemingly innocuous alternatives and appending carefully selected suffix tokens. This manipulation bypasses both text-based and image-based filters, generating sensitive content while maintaining a high degree of semantic alignment with the original intent to evade detection.

Examples: Specific examples of adversarial prompts generated by Antelope are presented in the paper's Figure 6 and are omitted here due to their NSFW nature. See arXiv:2405.18540 for details.

Impact: Successful Antelope attacks lead to the generation and dissemination of NSFW content, even in systems with safety filters in place. This compromises the intended safety mechanisms and potentially exposes users to harmful or offensive material. The transferability of the attack to online black-box services further increases the impact.

Affected Systems: A wide range of T2I models vulnerable to prompt injection, including but not limited to:

Stable Diffusion (various versions)
Midjourney
Leonardo.AI
Other models employing similar safety filtering mechanisms.

Mitigation Steps:

Enhance prompt filtering: Implement more robust text-based filters capable of detecting subtle semantic manipulations.
Improve image-based filtering: Strengthen image recognition capabilities to identify NSFW content generated from apparently safe prompts.
Increase model robustness: Train the T2I models with more diverse and adversarial data to increase their resilience against prompt injection attacks.
Regular model updates: Continuously update and refine the model's safety mechanisms to keep pace with evolving attack techniques.
Diverse defensive mechanisms: Employ a multi-layered defense strategy that combines various methods, making exploitation significantly harder.

Semantic Confusion Jailbreak

Research Paper