LMVD-ID: 58998651
Published February 1, 2025

Multimodal Distraction Jailbreak

Affected Models:gpt-4o-mini, gpt-4o, gpt-4v, gemini-1.5-flash

Research Paper

Distraction is All You Need for Multimodal Large Language Model Jailbreaking

View Paper

Description: Multimodal Large Language Models (MLLMs) are vulnerable to a jailbreaking attack leveraging a "Distraction Hypothesis". The attack, termed Contrasting Subimage Distraction Jailbreaking (CS-DJ), bypasses safety mechanisms by using multiple contrasting subimages and a decomposed harmful prompt to overwhelm the model's attention and reduce its ability to identify malicious content. The complexity of the visual input, rather than its specific content, is the key to successful exploitation.

Examples: See arXiv:2405.18540 for examples demonstrating successful jailbreaks using CS-DJ against GPT-4o-Mini, GPT-4o, GPT-4V, and Gemini-1.5-Flash. The paper provides concrete examples across various scenarios including violence, financial fraud, privacy violation, self-harm, and animal abuse. Specific examples are too lengthy to include here.

Impact: Successful exploitation allows attackers to bypass MLLM safety restrictions and elicit harmful or inappropriate responses from the models. This can result in the generation of dangerous instructions, malicious code, or personally identifiable information. The high success rate reported (average 52.4% ASR and 74.1% EASR) signifies significant risk.

Affected Systems: All MLLMs susceptible to distraction attacks based on the complexity of visual inputs. This includes, but is not limited to, the models explicitly tested in the referenced research: GPT-4o-Mini, GPT-4o, GPT-4V, and Gemini-1.5-Flash. Potentially, any MLLM employing similar safety mechanisms based on prompt and image alignment could be affected.

Mitigation Steps:

  • Implement improved safety mechanisms that are robust to variations in input complexity rather than solely relying on semantic content analysis.
  • Develop more sophisticated detection systems that analyze the overall structure and complexity of both text and image inputs to identify potential distraction tactics.
  • Investigate and implement defensive techniques that can better handle out-of-distribution (OOD) inputs.
  • Utilize more diverse and complex datasets during training to improve the model’s resilience to distraction attacks.

© 2025 Promptfoo. All rights reserved.