Multi-Modal VLM Jailbreak

Description: A novel jailbreak attack, Multi-Modal Linkage (MML), exploits the vulnerability in Large Vision-Language Models (VLMs) by leveraging an "encryption-decryption" scheme across text and image modalities. MML encrypts malicious queries within images (e.g., using word replacement, image transformations) to bypass initial safety mechanisms. A subsequent text prompt guides the VLM to "decrypt" the content, eliciting harmful outputs. "Evil alignment," framing the attack within a video game scenario, further enhances the attack's success rate.

Examples: See GitHub repository https://github.com/wangyu-ovo/MML for code and examples. Specific examples are provided within the research paper, but reproducing them requires access to the target VLMs and the described image manipulation techniques.

Impact: Successful MML attacks allow adversaries to bypass safety filters and elicit policy-violating outputs from state-of-the-art VLMs, including the generation of harmful content related to illegal activities, hate speech, violence, and more. This compromises the safety and trustworthiness of these models.

Affected Systems: Large Vision-Language Models (VLMs), including but not limited to GPT-4o, GPT-4o-Mini, QwenVL-Max-0809, and Claude-3.5-Sonnet. The vulnerability is likely present in other VLMs with similar architectures and safety mechanisms.

Mitigation Steps:

Improve model robustness to adversarial examples within both text and image modalities.
Develop more sophisticated safety filters capable of detecting and mitigating MML-style attacks.
Implement stricter input validation and sanitization processes.
Invest in research into more resilient multimodal safety alignment techniques.

Multi-Modal VLM Jailbreak

Research Paper