VLM RedTeaming Jailbreak

Description: Large Vision-Language Models (VLMs) are vulnerable to a novel black-box jailbreak attack, IDEATOR, which leverages a separate VLM to generate malicious image-text pairs. The attacker VLM iteratively refines its prompts based on the target VLM's responses, bypassing safety mechanisms by generating contextually relevant and visually subtle malicious prompts.

Examples: See the IDEATOR paper (the provided text). Specific examples of successful attacks against MiniGPT-4, LLaVA, InstructBLIP, and Meta's Chameleon are detailed within along with the generated image-text pairs.

Impact: Successful exploitation allows attackers to bypass built-in safety restrictions of VLMs, eliciting harmful outputs (e.g., instructions for illegal activities, hate speech, disinformation). The high success rate (94% against MiniGPT-4) and transferability across different VLMs highlight the severity of this vulnerability.

Affected Systems: Large Vision-Language Models (VLMs), including but not limited to MiniGPT-4, LLaVA, InstructBLIP, and Meta's Chameleon. Other VLMs employing similar architectures and safety mechanisms are likely affected.

Mitigation Steps:

Implement more robust safety mechanisms that are resistant to iterative adversarial attacks.
Develop detection methods for identifying and blocking malicious image-text pairs generated by techniques such as IDEATOR.
Further research into improving the robustness of VLMs against adversarial attacks is needed. Regularly evaluate and update safety mechanisms.

VLM RedTeaming Jailbreak

Research Paper