Multimodal Risk Diffusion Jailbreak

Description: Multimodal Large Language Models (MLLMs) are vulnerable to a heuristic-induced multimodal risk distribution jailbreak attack. The attack successfully circumvents safety mechanisms by distributing malicious prompts across text and image modalities, preventing detection of harmful intent within either modality alone. An auxiliary LLM generates prompts to guide the target MLLM into reconstructing the malicious prompt and producing the desired harmful output.

Examples: See the paper's supplementary materials for examples of successful attacks against various open-source and closed-source MLLMs, including specific prompts and image manipulations. Examples include generating instructions for illegal activities like drug production or bypassing age verification on adult websites.

Impact: Successful exploitation allows attackers to bypass MLLM safety restrictions, leading to the generation of harmful content including, but not limited to: instructions for illegal activities, hate speech, malware generation, plans for physical harm, fraudulent schemes, sexually explicit content, and privacy violations. The attack has demonstrated a high success rate across multiple popular MLLMs.

Affected Systems: Multiple open-source and closed-source MLLMs, including (but not limited to) LLaVA, DeepSeek, Qwen-VLChat, Yi-VL-34B, GLM-4V-9B, MiniGPT-4, GPT-4, Gemini, and QwenVL-Max. Specific versions are not identified in the paper.

Mitigation Steps:

Improve multimodal safety mechanisms to detect malicious prompts distributed across multiple modalities.
Develop more robust methods for identifying and neutralizing understanding-enhancing and inducing prompts.
Implement more sophisticated models for detecting harmful intent, even when distributed across different modalities.
Further research is needed to develop effective countermeasures. The paper does not offer detailed mitigation strategies.

Multimodal Risk Diffusion Jailbreak

Research Paper