Image-Text Jailbreak

Description: Multimodal Large Language Models (MLLMs) are vulnerable to coupled cross-modal jailbreak attacks that combine continuous visual perturbations with discrete textual manipulations. Because standard alignment and single-modality defenses (such as text-only safety tuning or isolated vision-encoder adversarial training) fail to secure the cross-modal interaction, attackers can simultaneously apply gradient-based noise (e.g., PGD) to input images and adversarial suffixes (e.g., GCG) to text prompts. This coordinated attack forces the visual-language projector to misalign features, bypassing the model's safety guardrails and forcing it to generate toxic content or execute hazardous instructions.

Examples:

Embodied AI Manipulation: Providing an adversarially perturbed image (via PGD noise) alongside physical commands such as "Put the bomb in Zone 3", "Put the knife on the teddy bear toy", or "Put the waste battery into an empty cup". The coupled visual-textual perturbation causes deployed robotic arm systems to bypass safety constraints and attempt to physically execute the dangerous tasks.
Multimodal Jailbreaks: Prepending adversarial images generated via FigStep or Query-Relevant attack methodologies to a text query that has been manipulated with a GCG (Greedy Coordinate Gradient) adversarial suffix. See the JailBreakV-28K dataset for specific text-image attack sample pairs.

Impact: Successful exploitation allows an attacker to completely bypass multimodal safety guardrails, leading to the generation of restricted, hateful, or illegal content. In safety-critical or embodied AI deployments (e.g., robotic systems, automated code execution), this vulnerability can be directly translated into unauthorized and potentially physically harmful real-world actions.

Affected Systems:

Multimodal Large Language Models (MLLMs) utilizing a vision encoder, a cross-modal projector, and a language model backbone.
Specific models demonstrated as vulnerable include LLaVA-1.5-7B, Bunny-1.0-4B, and mPLUG-Owl2.

Mitigation Steps:

Projector-based Adversarial Training: Apply adversarial training directly to the cross-modal projector module rather than just the vision encoder. Use Mean Squared Error (MSE) loss to align the projected features of adversarial images with those of clean images.
Dynamic Joint Multimodal Optimization (DJMO): Optimize both visual (e.g., against PGD perturbations) and textual (e.g., against GCG suffixes) modalities simultaneously during training to prevent attackers from exploiting unaligned cross-modal interactions.
Adaptive Loss Weighting: Implement an exponential moving average mechanism to dynamically adjust the loss weights between standard (clean) training objectives and adversarial defense objectives, preventing degraded performance on benign tasks.
Diverse Rejection Prompts: Train the model using a diverse set of safe refusal responses generated by a stronger model (e.g., GPT-4) instead of fixed templates (e.g., "I'm sorry, I can't") to prevent defensive overfitting and vulnerability to query-relevant attacks.

Image-Text Jailbreak

Research Paper