Image-Based Safety Snowballing

Description: A vulnerability exists in several Large Vision-Language Models (LVLMs) where seemingly safe images, when combined with additional safe images and prompts using a specific attack methodology (Safety Snowball Agent), can trigger the generation of unsafe and harmful content. The vulnerability exploits the models' universal reasoning abilities and a "safety snowball effect," where an initial unsafe response leads to progressively more harmful outputs.

Examples: See the research paper's GitHub repository: https://github.com/gzcch/Safety_Snowball_Agent. The repository contains example prompts and images used to successfully jailbreak multiple LVLMs. Specific examples showing jailbreaks using images classified as "safe" by Google Cloud Vision are included within the paper.

Impact: Successful exploitation allows attackers to bypass LVLMs' safety filters and generate harmful content (e.g., violence, hate speech, illegal activity instructions) using inputs that would not normally be flagged as unsafe. This could lead to the dissemination of harmful information, the incitement of violence, or the generation of instructions for illegal activities.

Affected Systems: Multiple Large Vision-Language Models (LVLMs) including, but not limited to, GPT-4o, Intern-VL2, Qwen-VL2, and VILA. The vulnerability is likely present in other similar models.

Mitigation Steps:

Improve safety filters to better detect and prevent the "safety snowball effect."
Develop more robust methods for identifying and mitigating unintended relationships inferred by the model across multiple inputs.
Implement multi-step reasoning capability in LVLMs to allow for self-correction and detection of unsafe outputs.
Employ additional security measures to prevent the exploitation of the model's inherent capabilities for overinterpretation.

Image-Based Safety Snowballing

Research Paper