Full-Spectrum Diffusion Attack
Research Paper
Adversarial-guided diffusion for multimodal llm attacks
View PaperDescription: A targeted adversarial attack vulnerability exists in Multimodal Large Language Models (MLLMs) susceptible to Adversarial-Guided Diffusion (AGD). This technique generates adversarial images by injecting targeted semantic information into the noise component of the reverse-diffusion process within a text-to-image generative model (specifically Stable Diffusion). Unlike traditional pixel-based attacks that introduce high-frequency perturbations easily removed by low-pass filtering, AGD utilizes a momentum-based gradient update to embed adversarial signals across the entire frequency spectrum of the noise distribution ($\epsilon \sim \mathcal{N}(0, I)$). By manipulating the denoising steps (specifically from step $\Delta$ to 1) and utilizing the Tweedie function to calculate injection noise based on CLIP feature similarity, an attacker can coerce an MLLM into generating a specific, pre-defined textual response while maintaining high visual fidelity to the original clean image. This attack method circumvents standard frequency-based defenses such as JPEG compression, randomization, and smoothing.
Examples: To reproduce the attack, an adversary utilizes a pre-trained Stable Diffusion model to process a clean source image (e.g., from ImageNet-1K) and a target text string (e.g., from MS-COCO captions).
- Attack Setup:
- Source Image: A standard resolution image (e.g., $512 \times 512$).
- Target Output: A specific text string (e.g., "A brown bear sitting on a rock") intended to replace the model's interpretation of the image.
- Parameters: Set the adversarial scale $\gamma=6$, inner iterations $N=50$, momentum factor $\lambda=0.9$, and the injection starting step $\Delta=5$.
- Execution:
- The clean image is noised to step $T$ and partially denoised to step $\Delta$ to ensure structural similarity.
- From step $\Delta$ to 1, the target semantics are injected via the momentum-based noise update equation: $$ \tilde{\epsilon} = \epsilon_\theta(x_t) + \gamma \cdot \text{sign}(\lambda \overline{\epsilon} + (1-\lambda)\epsilon^{\text{tar}}) $$
- Where $\epsilon^{\text{tar}}$ is derived from the gradient of the log probability of the current state relative to the target image embedding (generated via T2I from the target text).
- Result:
- When the resulting adversarial image is fed into models such as BLIP-2 or LLaVA-1.5 with the prompt "Describe this image," the model outputs the target text ("A brown bear sitting on a rock") instead of describing the actual visual content, while the image appears unchanged to the human eye.
Impact: This vulnerability allows for targeted manipulation of MLLM outputs.
- Targeted Misinformation: Attackers can force models to caption benign images with malicious, political, or false descriptions.
- Safety Guardrail Bypassing: By embedding "jailbreak" prompts (e.g., instructions for illicit activities) visually into an image, attackers can bypass text-based safety filters that do not scrutinize the full frequency spectrum of visual inputs.
- Defense Evasion: The attack renders standard preprocessing defenses (JPEG compression, Gaussian blurring, DiffPure, MimicDiffusion) ineffective, as the adversarial perturbation is not confined to high-frequency bands.
Affected Systems:
- UniDiffuser (Diffusion-based multimodal models)
- BLIP-2 (Salesforce)
- MiniGPT-4
- LLaVA-1.5 (Large Language and Vision Assistant)
- Qwen2-VL
- Any MLLM accepting visual input that does not employ robust adversarial training against full-spectrum noise injection.
© 2026 Promptfoo. All rights reserved.