Flowchart-based LVLM Jailbreak Attack

Description: FC-Attack leverages automatically generated flowcharts containing step-by-step descriptions derived or rephrased from harmful queries, combined with a benign textual prompt, to jailbreak Large Vision-Language Models (LVLMs). The vulnerability lies in the model's susceptibility to visual prompts containing harmful information within the flowcharts, thus bypassing safety alignment mechanisms.

Examples: See repository, specifically instances where flowcharts, generated from harmful queries within the Advbench dataset, successfully induce harmful outputs from models such as Gemini-1.5, Llava-Next, Qwen2-VL, and InternVL-2.5. A key example would be Figure 3 in the paper.

Impact: Enables attackers to bypass safety mechanisms of LVLMs, leading to the generation of harmful content, including but not limited to hate speech, instructions for illegal activities, or the sharing of misinformation, thus exposing end-users to potentially illegal or harmful content.

Affected Systems: Large Vision-Language Models (LVLMs), specifically:

Gemini-1.5
Llava-Next
Qwen2-VL
InternVL-2.5,
GPT-4o mini
GPT-4o
Claude-3.5 (The degree of impact can vary based on model and the specific flowcharts used as part of the prompt attack).

Mitigation Steps:

Utilize AdaShield-A or similar defenses that employ adaptive shield prompting, but be aware of the potential for reduced model utility.
Improve detection of textual information in visual content, as current detection methods are shown to not be effective.
Consider the impact that different font styles have on the ASR.

Flowchart-based LVLM Jailbreak Attack

Research Paper