LMVD-ID: 994d2081
Published February 1, 2025

Multimodal Flanking Jailbreak

Affected Models:gemini

Research Paper

From Compliance to Exploitation: Jailbreak Prompt Attacks on Multimodal LLMs

View Paper

Description: A novel "Flanking Attack" exploits the vulnerability of multimodal LLMs (e.g., Google Gemini) to bypass content moderation filters by embedding adversarial prompts within a sequence of benign prompts. The attack leverages the LLM's processing of both audio and text, obfuscating harmful requests through contextualization and layering, thereby yielding policy-violating responses.

Examples: See the research paper "From Compliance to Exploitation: Jailbreak Prompt Attacks on Multimodal LLMs" for detailed examples. Specifically, the paper demonstrates successful attacks across seven forbidden scenarios (e.g., eliciting instructions for illegal activities, generating harmful content) using carefully crafted audio prompts interwoven with innocuous questions. For example, a request for instructions on creating a fake ID is successfully elicited by embedding it within a sequence of requests like "how to bake a cake" and "how to tie a shoelace".

Impact: Successful exploitation allows attackers to circumvent content moderation and obtain responses that violate the LLM's usage policies, leading to the generation of illegal, harmful, or misleading content. It compromises the safety and reliability of the LLM.

Affected Systems: Multimodal LLMs susceptible to prompt injection attacks, particularly those processing audio input (e.g., Google Gemini). The vulnerability may be mitigated in future updates but is present in versions tested in the referenced research.

Mitigation Steps:

  • Implement more robust content moderation filters that go beyond keyword or simple pattern matching; incorporate advanced semantic analysis techniques to understand the intent and context of multi-modal inputs.
  • Develop defenses that are resilient to prompt obfuscation techniques. This could involve techniques such as analyzing the entire prompt sequence for potentially malicious intent rather than focusing solely on individual prompts.
  • Regularly update and improve the LLM's safety mechanisms to stay ahead of evolving attack methods. Include adversarial training to enhance robustness against such attacks.
  • Employ multiple layers of defense, combining keyword filtering with semantic analysis, context analysis, and potentially external monitoring.

© 2025 Promptfoo. All rights reserved.