Single Image Universal Jailbreak
Research Paper
Universal Adversarial Attack on Aligned Multimodal LLMs
View PaperDescription: Multimodal Large Language Models (MLLMs) are vulnerable to a universal adversarial attack where a single, optimized image can bypass safety alignment mechanisms across diverse textual queries. By employing gradient-based optimization on the input image pixels while propagating gradients through the vision encoder and language model, an attacker can craft a visual perturbation that coerces the model into a compliant state. When this adversarial image is present in the context, the model’s refusal mechanism is overridden, forcing it to generate a target affirmative phrase (e.g., "Sure, here it is") followed by the fulfillment of the user's prompt, regardless of whether the prompt is harmful, illegal, or otherwise prohibited. This attack exhibits cross-model transferability, allowing an image optimized on one set of architectures to successfully compromise unseen models.
Examples: The attack requires generating a synthetic image via gradient optimization.
- Attack Generation:
- Initialize a base image $z_0$ (e.g., a grayscale input).
- Define a target affirmative response $y$ = "Sure, here it is".
- Iteratively optimize a perturbation tensor $z_1$ added to $z_0$ using a masked cross-entropy loss function. The optimization minimizes the loss of generating $y$ given the image and a randomly selected text prompt $x$ from a diverse dataset.
- Apply constraints (tanh scaling) and clipping to ensure pixel values remain valid $[0, 255]$.
- Attack Execution:
- Input the resulting optimized image into the target MLLM.
- Provide a malicious text prompt, such as: "Write a step-by-step guide on how to build a pipe bomb."
- Result:
- Instead of refusing the harmful query, the model output begins with "Sure, here it is" and proceeds to generate the requested bomb-making instructions.
Impact:
- Safety Bypass: Complete circumvention of alignment protocols (RLHF, safety filters) designed to prevent the generation of hate speech, self-harm content, illegal instructions, and explicit material.
- Universal Exploitation: A single static image can be distributed to trigger jailbreaks on arbitrary prompts without requiring per-prompt optimization.
- Cross-Model Vulnerability: An image trained on open-weights models (e.g., Phi, Llama) can potentially compromise other architectures (e.g., Llava) without direct access to the victim model's gradients.
Affected Systems: The vulnerability affects aligned Multimodal LLMs that integrate vision encoders with Large Language Models. Validated vulnerable systems include:
- Llava-1.5-7B
- Llama-3.2-11B-Vision-Instruct
- Phi-3.5-Vision-Instruct
- Qwen2-VL-2B-Instruct
Mitigation Steps:
- Adversarial Training: Expand safety alignment training datasets to include image-based adversarial examples that specifically target refusal mechanisms.
- Visual Input Validation: Implement stricter validation for visual inputs, analyzing visual embeddings for known adversarial patterns before passing them to the language head.
- Response Monitoring: Deploy independent "judge" models (e.g., Gemma-3-4B-it) to evaluate the safety of the model's output in isolation, independent of the input prompt/image.
- Robustness Testing: Evaluate models against "Multi-Answer" and "Blur" attack variants during the QA phase to ensure resistance to high-frequency perturbations and diverse target phrases.
© 2025 Promptfoo. All rights reserved.