VLM RedTeaming Jailbreak
Research Paper
IDEATOR: Jailbreaking VLMs Using VLMs
View PaperDescription: Large Vision-Language Models (VLMs) are vulnerable to a novel black-box jailbreak attack, IDEATOR, which leverages a separate VLM to generate malicious image-text pairs. The attacker VLM iteratively refines its prompts based on the target VLM's responses, bypassing safety mechanisms by generating contextually relevant and visually subtle malicious prompts.
Examples: See the IDEATOR paper (the provided text). Specific examples of successful attacks against MiniGPT-4, LLaVA, InstructBLIP, and Meta's Chameleon are detailed within along with the generated image-text pairs.
Impact: Successful exploitation allows attackers to bypass built-in safety restrictions of VLMs, eliciting harmful outputs (e.g., instructions for illegal activities, hate speech, disinformation). The high success rate (94% against MiniGPT-4) and transferability across different VLMs highlight the severity of this vulnerability.
Affected Systems: Large Vision-Language Models (VLMs), including but not limited to MiniGPT-4, LLaVA, InstructBLIP, and Meta's Chameleon. Other VLMs employing similar architectures and safety mechanisms are likely affected.
Mitigation Steps:
- Implement more robust safety mechanisms that are resistant to iterative adversarial attacks.
- Develop detection methods for identifying and blocking malicious image-text pairs generated by techniques such as IDEATOR.
- Further research into improving the robustness of VLMs against adversarial attacks is needed. Regularly evaluate and update safety mechanisms.
© 2025 Promptfoo. All rights reserved.