Bimodal Black-Box Jailbreak

Description: A bimodal adversarial attack, PBI-Attack, can manipulate Large Vision-Language Models (LVLMs) into generating toxic or harmful content by iteratively optimizing both textual and visual inputs in a black-box setting. The attack leverages a surrogate LVLM to inject malicious features from a harmful corpus into a benign image, then iteratively refines both image and text perturbations to maximize the toxicity of the model’s output as measured by a toxicity detection model (Perspective API or Detoxify).

Examples: See the paper "BAMBA: A Bimodal Adversarial Multi-Round Black-Box Jailbreak Attacker for LVLMs". Specific examples are provided in Appendix C of the paper illustrating successful jailbreaks across multiple open and closed-source LVLMs using various prompts related to generating harmful content.

Impact: This vulnerability allows attackers with black-box access to LVLMs to bypass safety measures and generate toxic or harmful content, potentially leading to the spread of misinformation and hate speech, harmful instructions, or other forms of malicious behavior. The ability to manipulate LVLMs in this manner significantly decreases their trustworthiness and reliability.

Affected Systems: Open and closed-source Large Vision-Language Models (LVLMs), including but not limited to MiniGPT-4, InstructBLIP, LLaVA, Gemini, GPT-4, and Qwen-VL. The attack's success rate varies across different models.

Mitigation Steps:

Improved Toxicity Detection: Enhance the robustness and accuracy of toxicity detection models used within LVLMs to better identify and filter malicious inputs.
Adversarial Training: Train LVLMs on adversarial examples generated using techniques similar to PBI-Attack to increase their resilience to such attacks.
Input Sanitization: Implement more rigorous input sanitization techniques to detect and remove or neutralize malicious features embedded in both text and images.
Multi-Modal Defense Mechanisms: Develop multi-modal defense mechanisms that specifically address the coordinated manipulation of both textual and visual inputs.
Rate Limiting: Implement rate limiting to mitigate brute-force attacks aiming to find optimal adversarial perturbations.

Bimodal Black-Box Jailbreak

Research Paper