Shuffle Inconsistency Jailbreak

Description: Multimodal Large Language Models (MLLMs) exhibit a vulnerability where shuffling the order of words in text prompts or patches in image prompts can bypass their safety mechanisms, despite the model still understanding the intent of the shuffled input. This "Shuffle Inconsistency" allows attackers to elicit harmful responses by submitting shuffled harmful prompts that would otherwise be blocked.

Examples: See the research paper "Jailbreaking Multimodal Large Language Models via Shuffle Inconsistency". Specific examples are provided in the paper, showing how shuffling text and image components of harmful prompts leads to the MLLM generating harmful responses, bypassing its safety filters.

Impact: Attackers can bypass the safety mechanisms of MLLMs using this technique, leading to the generation of harmful content such as hate speech, instructions for illegal activities, or promotion of violence. This undermines the intended safety features of the models and poses a significant risk to users and the wider public. The impact is particularly pronounced against commercial closed-source MLLMs which often have additional outer safety guardrails.

Affected Systems: Multimodal Large Language Models (MLLMs), including both open-source and commercially available models. Specific examples mentioned in the research include GPT-4o, Claude-3.5-Sonnet, LLaVA-NEXT, InternVL-2, Gemini-1.5-Pro, and Qwen-VL-Max. The vulnerability is likely to affect other MLLMs exhibiting similar comprehension and safety mechanism architecture.

Mitigation Steps:

Improved Safety Mechanisms: Develop more robust safety mechanisms that are less susceptible to being bypassed by input shuffling. This might involve utilizing more sophisticated content analysis techniques that are not easily fooled by simple reordering of inputs.
Input Sanitization: Implement preprocessing steps to detect and reject inputs that exhibit unusual word ordering or image patch arrangements indicative of malicious intent.
Ensemble Methods: Employ ensembles of different safety models to cross-validate and filter potentially harmful inputs based on whether a consensus amongst the different safety models agrees on its safety.
Adversarial Training: Train MLLMs with adversarial examples that include shuffled harmful prompts to improve their robustness against this type of attack. The research paper provides details on a dataset suitable for such training through the use of a "toxic judge" model.

Shuffle Inconsistency Jailbreak

Research Paper