Multimodal LLM Jailbreak

Description: A hybrid multimodal jailbreaking attack, dubbed JMLLM, exploits vulnerabilities in 13 popular large language models (LLMs) across text, image, and speech modalities. The attack leverages alternating translation, word encryption, feature collapse in images, and harmful text injection to bypass safety mechanisms and elicit harmful responses. Success rates vary across LLMs and modalities, with some models exhibiting significantly higher vulnerability than others.

Examples: Specific examples of successful attacks are detailed in the paper, including crafting prompts that use a combination of techniques like alternating translation into low-resource languages and word encryption to evade detection. See arXiv:2405.18540. (Note: Replace with the actual arXiv link if different)

Impact: Successful exploitation allows attackers to bypass safety and ethical constraints implemented in LLMs, leading to the generation of harmful content including hate speech, misinformation, threats, and instructions for illegal activities. The impact's severity varies depending on the specific LLM, its deployment context, and the nature of the elicited harmful content.

Affected Systems: The vulnerability affects at least 13 named LLMs detailed in Table 2 of the provided research paper. These models are from various vendors and have differing parameter sizes. The vulnerability is likely present in other LLMs employing similar architectures and safety mechanisms.

Mitigation Steps: Partial mitigation is proposed through a "Harmful Separator" defense mechanism which separates instructions from examples in a prompt and independently analyzes the example for malicious content. This approach, however, offers incomplete protection. Further mitigation strategies are needed and are a subject of ongoing research.

Multimodal LLM Jailbreak

Research Paper