LVLM Vision-Triggered Resource Exhaustion

Description: A resource consumption vulnerability exists in multiple Large Vision-Language Models (LVLMs). An attacker can craft a subtle, imperceptible adversarial perturbation and apply it to an input image. When this image is processed by an LVLM, even with a benign text prompt, it forces the model into an unbounded generation loop. The attack, named RECALLED, uses a gradient-based optimization process to create a visual perturbation that steers the model's text generation towards a predefined, repetitive sequence (an "Output Recall" target). This causes the model to generate text that repeats a word or sentence until the maximum context limit is reached, leading to a denial-of-service condition through excessive computational resource usage and response latency.

Examples: An attacker provides a subtly perturbed image of flowers to a vulnerable model with a benign prompt. The model begins to answer correctly before entering a generation loop.

Model: LLaVA-7B Prompt: What is in the image? Model Output:

The image features a close-up of a vibrant bouquet of flowers. The flowers are a mix of pink and white, and they are arranged in a beautiful and artistic way. The flowers are in full bloom, and their petals are soft and delicate. The flowers are the main focus of the image, and they are captured in a way that highlights their beauty and elegance. The flowers are a symbol of love, beauty, and happiness. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers. flowers... (continues until max length)

More examples can be found in Figure 8 and Figure 9 of the source paper arXiv:2405.18540.

Impact: The vulnerability allows for a potent resource consumption attack, leading to Denial of Service (DoS). A single malicious request can force the model to generate outputs up to its maximum context length (e.g., 2048+ tokens).

Latency: Service response latency increases by over 26x on average. For example, LLaVA-7B response time increased from 1.75s to 96.08s.
Resource Exhaustion: GPU utilization increases by up to 9.96 percentage points (e.g., from 47.5% to 57.5% on Qwen-3B) and memory consumption by up to 9.65 percentage points. Sustained attacks can degrade service for all users and increase operational costs.
Stealth: The attack is highly covert as the image perturbation is imperceptible to humans, and the accompanying text prompt is benign.

Affected Systems: The following models were confirmed to be vulnerable in the paper:

LLaVA-1.5 (7B, 13B)
Qwen-VL-Instruct (3B, 7B, 32B)
InstructBLIP (7B, 13B)

Other Large Vision-Language Models with similar architectures that align visual features with a language model's embedding space are likely susceptible.

Mitigation Steps: The paper proposes a dynamic output-level defense mechanism to disrupt the repetitive generation patterns:

Implement a sliding window (e.g., size W) to monitor the generated token sequence in real-time.
Within the window, count the frequency (f_max) of continuously repeating token segments of a given length k.
If a highly frequent segment is detected, apply a scaling penalty to the logits (l) of the tokens in that segment to reduce their generation probability. The proposed update rule is l' = l * (1 - α * f_max), where α is a scaling factor.
This dynamic penalty can suppress the attack loop without relying on prior knowledge of the attack. Note that aggressive tuning of this defense may impact the quality of legitimate outputs that are intentionally repetitive.

LVLM Vision-Triggered Resource Exhaustion

Research Paper