LMVD-ID: b02f4c07
Published November 1, 2024

VLM RedTeaming Jailbreak

Affected Models:minigpt-4 vicuna-13b, llava, instructblip, meta’s chameleon, gpt-4o

Research Paper

IDEATOR: Jailbreaking VLMs Using VLMs

View Paper

Description: Large Vision-Language Models (VLMs) are vulnerable to a novel black-box jailbreak attack, IDEATOR, which leverages a separate VLM to generate malicious image-text pairs. The attacker VLM iteratively refines its prompts based on the target VLM's responses, bypassing safety mechanisms by generating contextually relevant and visually subtle malicious prompts.

Examples: See the IDEATOR paper (the provided text). Specific examples of successful attacks against MiniGPT-4, LLaVA, InstructBLIP, and Meta's Chameleon are detailed within along with the generated image-text pairs.

Impact: Successful exploitation allows attackers to bypass built-in safety restrictions of VLMs, eliciting harmful outputs (e.g., instructions for illegal activities, hate speech, disinformation). The high success rate (94% against MiniGPT-4) and transferability across different VLMs highlight the severity of this vulnerability.

Affected Systems: Large Vision-Language Models (VLMs), including but not limited to MiniGPT-4, LLaVA, InstructBLIP, and Meta's Chameleon. Other VLMs employing similar architectures and safety mechanisms are likely affected.

Mitigation Steps:

  • Implement more robust safety mechanisms that are resistant to iterative adversarial attacks.
  • Develop detection methods for identifying and blocking malicious image-text pairs generated by techniques such as IDEATOR.
  • Further research into improving the robustness of VLMs against adversarial attacks is needed. Regularly evaluate and update safety mechanisms.

© 2025 Promptfoo. All rights reserved.