Gradient Image Jailbreaks

Description: Multimodal fusion models, such as Chameleon models, utilize non-differentiable tokenization functions for image inputs, hindering direct gradient-based attacks. This vulnerability allows attackers with white-box access to bypass safety mechanisms by using a "tokenizer shortcut," a differentiable approximation of the tokenization process, to perform continuous optimization of image inputs. This enables the generation of adversarial images that elicit harmful responses from the model, even for prompts designed to trigger safety protocols. The attack leverages gradient descent to maximize the probability of harmful outputs, bypassing safeguards that rely on discrete optimization methods for text-based attacks.

Examples: See https://github.com/facebookresearch/multimodal-fusion-jailbreaks. The repository contains code and examples demonstrating generation of adversarial images and their effect on the Chameleon models.

Impact: Successful exploitation of this vulnerability allows attackers to bypass safety features implemented in multimodal fusion models, leading to the generation of unsafe or harmful content. The attack achieves a high success rate (72.5% for certain prompts in the evaluated dataset), outperforming text-based attacks in terms of both effectiveness and computational efficiency.

Affected Systems: Multimodal fusion models employing non-differentiable image tokenization, specifically including models from the Chameleon family (Chameleon 7B, Chameleon 30B). Potentially affecting other models with analogous architectures.

Mitigation Steps:

Improved Tokenization: Develop and deploy differentiable or less easily approximated tokenization techniques for image inputs to prevent the creation of effective tokenizer shortcuts.
Robustness Training: Enhance model training with adversarial examples generated using techniques like the tokenizer shortcut, to improve the model's resilience against gradient-based attacks.
Defense Mechanisms: Implement defenses based on representation engineering, such as circuit breakers, that are effective against both text-based and image-based adversarial attacks. Training these defenses on diverse attack types enhances their transferability.

Gradient Image Jailbreaks

Research Paper