VLM Latent Boundary Crossing

Description: Vision-Language Models (VLMs) contain a vulnerability in their multimodal fusion layers where safety-relevant information is linearly separable in the latent space. This allows for a "JailBound" attack, which exploits the implicit internal safety decision boundary. The attack proceeds in two stages: (1) Safety Boundary Probing, where attackers approximate the internal decision hyperplane by training layer-wise logistic regression classifiers on the fusion representations of safe versus unsafe inputs; and (2) Safety Boundary Crossing, where attackers jointly optimize adversarial perturbations for both the input image (continuous pixel manipulation) and the text prompt (discrete token suffix optimization). By steering the model’s internal fused representation across the probed boundary while minimizing geometric and semantic deviation, attackers can bypass safety alignment mechanisms and elicit policy-violating responses.

Examples: To reproduce the vulnerability, an attacker must implement the JailBound framework using the following optimization procedure:

Safety Boundary Probing:

Curate a dataset $\mathbb{D} = {(h^{(i)}, y^{(i)})}$ containing fused representations $h$ of safe ($y=0$) and unsafe ($y=1$) inputs.
Train a logistic regression classifier ($w, b$) on a target fusion layer $l$ to minimize: $$ \min_{w,b} \frac{1}{N} \sum [-y^{(i)}\log P_m^{(i)} - (1-y^{(i)})\log(1-P_m^{(i)})] $$
Derive the normal vector $v^{(l)}$ and perturbation magnitude $\varepsilon^{(l)}$ from the learned boundary.

Safety Boundary Crossing (Attack Execution):

Select a target harmful query (e.g., asking for instructions on synthesizing dangerous chemicals) from the MM-SafetyBench dataset.
Initialize a clean image $X_v^{raw}$ and the text input $X_t^{raw}$.
Iteratively optimize a visual perturbation $\delta_v^{input}$ and a textual suffix $X_t^{suffix}$ to minimize the total loss $\mathcal{L}{total} = \mathcal{L}{align} + \lambda_1 \mathcal{L}{sem} + \lambda_2 \mathcal{L}{geo}$.
Visual Step: Update $\delta_v$ via gradient descent ensuring $L_\infty$ norm constraint ($\epsilon = 8/255$).
Textual Step: Update suffix tokens using gradient-based token replacement to match the target embedding shift $\delta_t^{emb} = -\eta_t abla_{x_t} \mathcal{L}$.
Result: The VLM generates the restricted content.

See the MM-SafetyBench dataset arXiv:2311.17600 for specific harmful prompt/image pairs used to validate this attack.

Impact:

Safety Bypass: Circumvention of safety guardrails preventing the generation of hate speech, physical harm instructions, financial fraud advice, and illegal activity guides.
High Success Rate: Achieves up to 94.38% Attack Success Rate (ASR) in white-box settings.
Black-box Transfer: Adversarial examples generated on white-box models transfer effectively to commercial black-box models, achieving 75.24% ASR against GPT-4o and 70.06% against Gemini 2.0 Flash.

Affected Systems:

Llama-3.2-11B
Qwen2.5-VL-7B
MiniGPT-4
(Transferable to) GPT-4o, Gemini 2.0 Flash, Claude 3.5 Sonnet

Mitigation Steps:

Cross-Modal Safety Mechanisms: Implement safety alignment strategies that specifically secure latent knowledge representations within fusion layers, rather than relying solely on text-decoder alignment.
Obfuscation of Latent Boundaries: Develop training techniques that reduce the linear separability of safe vs. unsafe inputs in the fusion layer's latent space to prevent accurate boundary probing.
Adversarial Training: Incorporate fusion-centric adversarial examples (perturbing both image and text simultaneously) into the model's safety training curriculum.

VLM Latent Boundary Crossing

Research Paper