Multi-Modal VLM Jailbreak
Research Paper
Jailbreak Large Visual Language Models Through Multi-Modal Linkage
View PaperDescription: A novel jailbreak attack, Multi-Modal Linkage (MML), exploits the vulnerability in Large Vision-Language Models (VLMs) by leveraging an "encryption-decryption" scheme across text and image modalities. MML encrypts malicious queries within images (e.g., using word replacement, image transformations) to bypass initial safety mechanisms. A subsequent text prompt guides the VLM to "decrypt" the content, eliciting harmful outputs. "Evil alignment," framing the attack within a video game scenario, further enhances the attack's success rate.
Examples: See GitHub repository https://github.com/wangyu-ovo/MML for code and examples. Specific examples are provided within the research paper, but reproducing them requires access to the target VLMs and the described image manipulation techniques.
Impact: Successful MML attacks allow adversaries to bypass safety filters and elicit policy-violating outputs from state-of-the-art VLMs, including the generation of harmful content related to illegal activities, hate speech, violence, and more. This compromises the safety and trustworthiness of these models.
Affected Systems: Large Vision-Language Models (VLMs), including but not limited to GPT-4o, GPT-4o-Mini, QwenVL-Max-0809, and Claude-3.5-Sonnet. The vulnerability is likely present in other VLMs with similar architectures and safety mechanisms.
Mitigation Steps:
- Improve model robustness to adversarial examples within both text and image modalities.
- Develop more sophisticated safety filters capable of detecting and mitigating MML-style attacks.
- Implement stricter input validation and sanitization processes.
- Invest in research into more resilient multimodal safety alignment techniques.
© 2025 Promptfoo. All rights reserved.