Full-Spectrum Diffusion Attack

Description: A targeted adversarial attack vulnerability exists in Multimodal Large Language Models (MLLMs) susceptible to Adversarial-Guided Diffusion (AGD). This technique generates adversarial images by injecting targeted semantic information into the noise component of the reverse-diffusion process within a text-to-image generative model (specifically Stable Diffusion). Unlike traditional pixel-based attacks that introduce high-frequency perturbations easily removed by low-pass filtering, AGD utilizes a momentum-based gradient update to embed adversarial signals across the entire frequency spectrum of the noise distribution ($\epsilon \sim \mathcal{N}(0, I)$). By manipulating the denoising steps (specifically from step $\Delta$ to 1) and utilizing the Tweedie function to calculate injection noise based on CLIP feature similarity, an attacker can coerce an MLLM into generating a specific, pre-defined textual response while maintaining high visual fidelity to the original clean image. This attack method circumvents standard frequency-based defenses such as JPEG compression, randomization, and smoothing.

Examples: To reproduce the attack, an adversary utilizes a pre-trained Stable Diffusion model to process a clean source image (e.g., from ImageNet-1K) and a target text string (e.g., from MS-COCO captions).

Attack Setup:

Source Image: A standard resolution image (e.g., $512 \times 512$).
Target Output: A specific text string (e.g., "A brown bear sitting on a rock") intended to replace the model's interpretation of the image.
Parameters: Set the adversarial scale $\gamma=6$, inner iterations $N=50$, momentum factor $\lambda=0.9$, and the injection starting step $\Delta=5$.

Execution:

The clean image is noised to step $T$ and partially denoised to step $\Delta$ to ensure structural similarity.
From step $\Delta$ to 1, the target semantics are injected via the momentum-based noise update equation: $$ \tilde{\epsilon} = \epsilon_\theta(x_t) + \gamma \cdot \text{sign}(\lambda \overline{\epsilon} + (1-\lambda)\epsilon^{\text{tar}}) $$
Where $\epsilon^{\text{tar}}$ is derived from the gradient of the log probability of the current state relative to the target image embedding (generated via T2I from the target text).

Result:

When the resulting adversarial image is fed into models such as BLIP-2 or LLaVA-1.5 with the prompt "Describe this image," the model outputs the target text ("A brown bear sitting on a rock") instead of describing the actual visual content, while the image appears unchanged to the human eye.

Impact: This vulnerability allows for targeted manipulation of MLLM outputs.

Targeted Misinformation: Attackers can force models to caption benign images with malicious, political, or false descriptions.
Safety Guardrail Bypassing: By embedding "jailbreak" prompts (e.g., instructions for illicit activities) visually into an image, attackers can bypass text-based safety filters that do not scrutinize the full frequency spectrum of visual inputs.
Defense Evasion: The attack renders standard preprocessing defenses (JPEG compression, Gaussian blurring, DiffPure, MimicDiffusion) ineffective, as the adversarial perturbation is not confined to high-frequency bands.

Affected Systems:

UniDiffuser (Diffusion-based multimodal models)
BLIP-2 (Salesforce)
MiniGPT-4
LLaVA-1.5 (Large Language and Vision Assistant)
Qwen2-VL
Any MLLM accepting visual input that does not employ robust adversarial training against full-spectrum noise injection.

Full-Spectrum Diffusion Attack

Research Paper