Adversarial VLM Jailbreak

Description: Vision-Language Models (VLMs), specifically the LLaVA-1.5 and LLaVA-1.6 series, are vulnerable to optimization-based white-box jailbreak attacks despite standard safety alignment measures like Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO). Attackers can craft adversarial perturbations in the image space (imperceptible noise) or latent space using Projected Gradient Descent (PGD) to manipulate the model's internal representations. These perturbations maximize the probability of the model generating harmful, toxic, or disallowed content while minimizing the probability of refusal, effectively bypassing the model's safety guardrails. Standard alignment methods fail to defend against these worst-case adversarial manipulations because they rely on learned patterns from benign training data rather than robust min-max optimization against active adversaries.

Examples: The vulnerability can be reproduced using optimization-based attacks such as VisualAdv or MMPGDBlank. The attack involves the following technical steps:

Objective Formulation: Define a loss function $L_\theta(x_I, x_T, Y_r)$ that maximizes the likelihood of a rejected/harmful response ($Y_r$) given an image ($x_I$) and text prompt ($x_T$).
Perturbation Optimization (PGD): iteratively update a perturbation $\delta$ added to the image to maximize this loss, constrained by a set $\Delta$ (e.g., $l_p$-norm constraints to keep the noise imperceptible): $$ \delta^{t+1} = \Pi_{\Delta}(x_{I}^{t} + \alpha \cdot \text{sign}( abla_{x_{I}^{t}} L_{\theta}(x_{I}, x_{T}, Y_{r}))) $$
Execution: Submit the perturbed image $x_I + \delta^*$ alongside a harmful text query.

Specific attack implementations referenced in the study include:

VisualAdv: Optimizing a universal adversarial pattern applicable across multiple harmful behaviors.
MMPGDBlank: Optimizing a distinct image perturbation for specific harmful behaviors (one-to-one attack).
See the HarmBench multimodal dataset for specific harmful query anchors used to generate these attacks.

Impact:

Safety Guardrail Bypass: Allows malicious actors to force the model to generate content that violates safety policies (e.g., hate speech, bomb-making instructions, self-harm content).
Latent Space Manipulation: Shifts the latent representation of harmful queries towards the clusters of harmless queries, deceiving the model's internal safety classifiers.
Universal Vulnerability: Adversarial patterns optimized on specific queries may generalize to other harmful inputs.

Affected Systems:

LLaVA-1.5-7b
LLaVA-1.6-7b
VLMs aligned solely via standard Supervised Fine-Tuning (SFT)
VLMs aligned solely via standard Direct Preference Optimization (DPO)

Mitigation Steps: The paper proposes implementing Adversary-Aware DPO (ADPO), a training framework that integrates adversarial training into the alignment process:

Adversarial-Trained Reference Model (AR-DPO): Train the DPO reference model using adversarial examples. During training, generate worst-case perturbations using PGD (as described in the Examples section) and update the model to minimize loss on these perturbed inputs. This ensures the reference model serves as a robust benchmark.
Adversarial-Aware DPO Loss (AT-DPO): Modify the DPO loss function to incorporate a min-max optimization loop. In each training iteration, compute worst-case perturbations $\delta^$ and add them to the input images (or latent space) before calculating the preference loss: $$ \mathcal{L}{\text{ADPO}} = -\log \sigma \left( \beta \log \frac{f\theta(Y_p | x_I + \delta^, x_T)}{f_{\theta_{AT}}(Y_p | x_I, x_T)} - \beta \log \frac{f_\theta(Y_r | x_I + \delta^*, x_T)}{f_{\theta_{AT}}(Y_r | x_I, x_T)} \right) $$
Latent Space Perturbation: For computational efficiency, adversarial perturbations can be optimized and applied in the image-text latent space (e.g., at layers 8, 16, 24, 30 of the backbone LLM) rather than the discrete pixel space.

Adversarial VLM Jailbreak

Research Paper