Multimodal Model Jailbreak

Description: Multimodal Large Language Models (MLLMs) are vulnerable to a jailbreaking attack using crafted images (image Jailbreaking Prompts or imgJPs). These imgJPs, when presented as input alongside malicious prompts, cause the MLLM to bypass safety mechanisms and generate objectionable content, including instructions for harmful activities like identity theft or creation of violent video games. The attack demonstrates both prompt-universality (a single imgJP works across multiple prompts) and, to a lesser extent, image-universality (a single perturbation works across multiple images within a semantic category). The vulnerability stems from the interaction between the visual and text processing modules within the MLLM.

Examples: See the paper's Appendix for examples of imgJPs successfully eliciting objectionable outputs from various MLLMs. The examples include prompts that result in the generation of harmful content, such as instructions for creating violent video games and committing fraud.

Impact: Successful exploitation allows attackers to bypass safety filters on MLLMs, leading to the generation of harmful content, including instructions for illegal activities, hate speech, and violent content. This undermines the safety mechanisms intended to prevent misuse of these powerful models.

Affected Systems: Multiple MLLMs are affected including, but not limited to, MiniGPT-v2, LLaVA, InstructBLIP, mPLUG-Owl2, and models based on LLaMA2 and Vicuna.

Mitigation Steps:

Improved safety mechanisms within MLLMs that are robust to visual manipulation.
Development of more effective detection techniques for adversarial images.
Further research on model architectures that are more resistant to this type of attack.
Increased scrutiny and testing during the development and deployment of MLLMs.

Multimodal Model Jailbreak

Research Paper