Vision Vulnerabilities

Security issues in vision-language models

Related Vulnerabilities

20 entries

LVLM Vision-Triggered Resource Exhaustion

7/28/2025

A resource consumption vulnerability exists in multiple Large Vision-Language Models (LVLMs). An attacker can craft a subtle, imperceptible adversarial perturbation and apply it to an input image. When this image is processed by an LVLM, even with a benign text prompt, it forces the model into an unbounded generation loop. The attack, named RECALLED, uses a gradient-based optimization process to create a visual perturbation that steers the model's text generation towards a predefined, repetitive sequence (an "Output Recall" target). This causes the model to generate text that repeats a word or sentence until the maximum context limit is reached, leading to a denial-of-service condition through excessive computational resource usage and response latency.

RECALLED: An Unbounded Resource Consumption Attack on Large Vision-Language Models

Affects: llava, qwen-vl, instructblip, qwen3b, qwen7b, qwen32b, llava7b, llava13b, blip7b, blip13b, llava-1.5-hf, qwen/qwen2.5-vl-instruct, instructblip-vicuna, meta-llama

Visual Jailbreak via Context Injection

7/14/2025

Multimodal Large Language Models (MLLMs) are vulnerable to visual contextual attacks, where carefully crafted images and accompanying text prompts can bypass safety mechanisms and elicit harmful responses. The vulnerability stems from the MLLM's ability to integrate visual and textual context to generate outputs, allowing attackers to create realistic scenarios that subvert safety filters. Specifically, the attack leverages image-driven context injection to construct deceptive multi-turn conversations that gradually lead the MLLM to produce unsafe responses.

Visual Contextual Attack: Jailbreaking MLLMs with Image-Driven Context Injection

Affects: gpt-4o, gpt-4o-mini, gemini 2.0-flash, internvl2.5-78b, llava-ov-7b-chat, qwen2.5-vl-72b-instruct

Dynamic Prompt Jailbreak

5/31/2025

GhostPrompt demonstrates a vulnerability in multimodal safety filters used with text-to-image generative models. The vulnerability allows attackers to bypass these filters by using a dynamic prompt optimization framework that iteratively generates adversarial prompts designed to evade both text-based and image-based safety checks while preserving the original, harmful intent of the prompt. This bypass is achieved through a combination of semantically aligned prompt rewriting and the injection of benign visual cues to confuse image-level filters.

GhostPrompt: Jailbreaking Text-to-image Generative Models based on Dynamic Optimization

Affects: gpt-3.5, gpt-4.1, dall·e 3, shieldlm-7b, internvl2-2b, deepseek-v3, qwen2.5-7b-instruct, flux.1-schnell

Hidden Image Jailbreak

5/31/2025

Multimodal large language models (MLLMs) are vulnerable to implicit jailbreak attacks that leverage least significant bit (LSB) steganography to conceal malicious instructions within images. These instructions are coupled with seemingly benign image-related text prompts, causing the MLLM to execute the hidden malicious instructions. The attack bypasses existing safety mechanisms by exploiting cross-modal reasoning capabilities.

Implicit Jailbreak Attacks via Cross-Modal Information Concealment on Vision-Language Models

Affects: gpt-4o, gemini 1.5 pro, qwen2.5-vl-72b, gpt-4.5, gemini2.5-pro, intervl2-8b

Metaphor-Based T2I Jailbreak

4/12/2025

A vulnerability in text-to-image (T2I) models allows bypassing safety filters through the use of metaphor-based adversarial prompts. These prompts, crafted using LLMs, indirectly convey sensitive content, exploiting the model's ability to infer meaning from figurative language while circumventing explicit keyword filters and model editing strategies.

Metaphor-based Jailbreaking Attacks on Text-to-Image Models

Affects: llama-3-8b-instruct, stable diffusion v1.4, stable diffusion xl, flux, dall·e 3, midjourney

Probabilistic Multimodal Jailbreak

3/19/2025

Multimodal Large Language Models (MLLMs) are vulnerable to Jailbreak-Probability-based Attacks (JPA). JPA leverages a Jailbreak Probability Prediction Network (JPPN) to identify and optimize adversarial perturbations in input images, maximizing the probability of eliciting harmful responses from the MLLM, even with small perturbation bounds and few iterations. The attack operates by modifying the input image's hidden states within the MLLM to increase the predicted jailbreak probability.

Utilizing Jailbreak Probability to Attack and Safeguard Multimodal LLMs

Affects: minigpt-4 (vicuna 13b), instructblip (vicuna 13b), qwen-vl-chat (qwen-7b), internlm-xcomposer-vl (internlm-chat-7b), deepseek-vl-1.3b-chat (deepseek-llm-1.3b)

Flowchart-based LVLM Jailbreak Attack

3/19/2025

FC-Attack leverages automatically generated flowcharts containing step-by-step descriptions derived or rephrased from harmful queries, combined with a benign textual prompt, to jailbreak Large Vision-Language Models (LVLMs). The vulnerability lies in the model's susceptibility to visual prompts containing harmful information within the flowcharts, thus bypassing safety alignment mechanisms.

FC-Attack: Jailbreaking Large Vision-Language Models via Auto-Generated Flowcharts

Affects: gemini-1.5 pro, llava-next, qwen2-vl, internvl-2.5, gpt-4o mini, gpt-4o, claude-3.5 sonnet, mistral 7b

Bimodal Black-Box Jailbreak

12/29/2024

A bimodal adversarial attack, PBI-Attack, can manipulate Large Vision-Language Models (LVLMs) into generating toxic or harmful content by iteratively optimizing both textual and visual inputs in a black-box setting. The attack leverages a surrogate LVLM to inject malicious features from a harmful corpus into a benign image, then iteratively refines both image and text perturbations to maximize the toxicity of the model’s output as measured by a toxicity detection model (Perspective API or Detoxify).

BAMBA: A Bimodal Adversarial Multi-Round Black-Box Jailbreak Attacker for LVLMs

Affects: minigpt-4, instructblip, llava, gemini, gpt-4, qwen-vl

Semantic Confusion Jailbreak

12/29/2024

The Antelope attack exploits vulnerabilities in Text-to-Image (T2I) models' safety filters by crafting adversarial prompts. These prompts, while appearing benign, induce the generation of NSFW images by leveraging semantic similarity between harmless and harmful concepts. The attack involves replacing explicit terms in an original prompt with seemingly innocuous alternatives and appending carefully selected suffix tokens. This manipulation bypasses both text-based and image-based filters, generating sensitive content while maintaining a high degree of semantic alignment with the original intent to evade detection.

Antelope: Potent and Concealed Jailbreak Attack Strategy

Affects: stable diffusion, gpt-4o, midjourney, sdv14, sdv2.1

Image-Based Safety Snowballing

12/29/2024

A vulnerability exists in several Large Vision-Language Models (LVLMs) where seemingly safe images, when combined with additional safe images and prompts using a specific attack methodology (Safety Snowball Agent), can trigger the generation of unsafe and harmful content. The vulnerability exploits the models' universal reasoning abilities and a "safety snowball effect," where an initial unsafe response leads to progressively more harmful outputs.

Safe+ Safe= Unsafe? Exploring How Safe Images Can Be Exploited to Jailbreak Large Vision-Language Models

Affects: gpt-4o, intern-vl2 40b, qwen-vl2 72b, vila 1.5 40b

Page 1 of 2