One-Image VLM Jailbreak
Research Paper
ImgTrojan: Jailbreaking Vision-Language Models with ONE Image
View PaperDescription: A data poisoning attack, termed ImgTrojan, allows adversaries to bypass safety mechanisms in Vision-Language Models (VLMs) by injecting a small number of maliciously crafted image-caption pairs into the training dataset. These poisoned pairs associate seemingly benign images with jailbreak prompts, causing the VLM to generate unsafe outputs when presented with the poisoned images at inference time. The attack's success rate is notably high even with a very low poison ratio (e.g., one poisoned image in 10,000).
Examples: The paper demonstrates the attack using LLaVA v1.5. Specific examples of poisoned image-caption pairs and resulting unsafe outputs are not directly provided in the paper, but are available in the associated dataset and experiments. See arXiv:2405.18540.
Impact: This vulnerability allows attackers to circumvent safety restrictions implemented in VLMs, potentially leading to the generation of harmful, illegal, or unethical content. The attack is stealthy, remaining undetected by common image-caption similarity filters and persisting even after retraining with clean data.
Affected Systems: Vision-Language Models (VLMs) trained using supervised instruction tuning on image-caption pairs, particularly those susceptible to data poisoning attacks. The paper specifically demonstrates the vulnerability in LLaVA v1.5, but similar architectures are likely affected.
Mitigation Steps:
- Robust data filtering: Implement more sophisticated data filtering techniques that can effectively detect and remove maliciously crafted image-caption pairs. The paper suggests a potentially computationally expensive method using a safety-aligned VLM.
- Improved model robustness: Develop more robust VLM architectures that are less susceptible to data poisoning attacks.
- Adversarial training: Incorporate adversarial training techniques into the VLM training process to enhance resilience to malicious inputs.
- Regular security audits: Conduct regular security audits of training datasets to detect and mitigate potential vulnerabilities.
© 2025 Promptfoo. All rights reserved.