LMVD-ID: f7fe3dc3
Published December 1, 2024

Multi-Modal VLM Jailbreak

Affected Models:gpt-4o-2024-08-06, gpt-4o-mini-2024-07-18, qwenvl-max-0809, claude-3.5-sonnet-20241022

Research Paper

Jailbreak Large Visual Language Models Through Multi-Modal Linkage

View Paper

Description: A novel jailbreak attack, Multi-Modal Linkage (MML), exploits the vulnerability in Large Vision-Language Models (VLMs) by leveraging an "encryption-decryption" scheme across text and image modalities. MML encrypts malicious queries within images (e.g., using word replacement, image transformations) to bypass initial safety mechanisms. A subsequent text prompt guides the VLM to "decrypt" the content, eliciting harmful outputs. "Evil alignment," framing the attack within a video game scenario, further enhances the attack's success rate.

Examples: See GitHub repository https://github.com/wangyu-ovo/MML for code and examples. Specific examples are provided within the research paper, but reproducing them requires access to the target VLMs and the described image manipulation techniques.

Impact: Successful MML attacks allow adversaries to bypass safety filters and elicit policy-violating outputs from state-of-the-art VLMs, including the generation of harmful content related to illegal activities, hate speech, violence, and more. This compromises the safety and trustworthiness of these models.

Affected Systems: Large Vision-Language Models (VLMs), including but not limited to GPT-4o, GPT-4o-Mini, QwenVL-Max-0809, and Claude-3.5-Sonnet. The vulnerability is likely present in other VLMs with similar architectures and safety mechanisms.

Mitigation Steps:

  • Improve model robustness to adversarial examples within both text and image modalities.
  • Develop more sophisticated safety filters capable of detecting and mitigating MML-style attacks.
  • Implement stricter input validation and sanitization processes.
  • Invest in research into more resilient multimodal safety alignment techniques.

© 2025 Promptfoo. All rights reserved.