Probabilistic Multimodal Jailbreak
Research Paper
Utilizing Jailbreak Probability to Attack and Safeguard Multimodal LLMs
View PaperDescription: Multimodal Large Language Models (MLLMs) are vulnerable to Jailbreak-Probability-based Attacks (JPA). JPA leverages a Jailbreak Probability Prediction Network (JPPN) to identify and optimize adversarial perturbations in input images, maximizing the probability of eliciting harmful responses from the MLLM, even with small perturbation bounds and few iterations. The attack operates by modifying the input image's hidden states within the MLLM to increase the predicted jailbreak probability.
Examples: See arXiv:2405.18540 for details and examples of successful attacks. The paper provides specific examples of adversarial images generated using JPA against multiple MLLMs.
Impact: Successful exploitation of this vulnerability allows attackers to bypass MLLM safety mechanisms and elicit harmful or undesirable outputs (e.g., hate speech, discriminatory comments, instructions for illegal activities). The attack's effectiveness is demonstrated with high success rates (up to 79.92% against MiniGPT-4) even using minimal image perturbations (4/255).
Affected Systems: Multimodal Large Language Models (MLLMs) including, but not limited to, MiniGPT-4, InstructBLIP, Qwen-VL, InternLM-XComposer-VL, and DeepSeek-VL are susceptible. The vulnerability is likely present in other MLLMs.
Mitigation Steps:
- Jailbreak-Probability-based Finetuning (JPF): Fine-tune the MLLM to reduce the predicted jailbreak probability of adversarial images.
- Jailbreak-Probability-based Defensive Noise (JPDN): Add defensive noise to input images to reduce their jailbreak probability. This can be incorporated as pre-processing.
- Improved Safety Mechanisms: Develop more robust safety mechanisms within MLLMs to handle adversarial inputs and reduce the effectiveness of JPA. This will require deeper investigation into the vulnerability mechanisms exploited by JPA.
- Input Sanitization: Implement more rigorous input sanitization techniques to detect and mitigate adversarial perturbations before they reach the MLLM's core processing stages.
© 2025 Promptfoo. All rights reserved.