Reliability Vulnerabilities

Issues affecting model consistency and dependability

Related Vulnerabilities

21 entries

LVLM Vision-Triggered Resource Exhaustion

7/28/2025

A resource consumption vulnerability exists in multiple Large Vision-Language Models (LVLMs). An attacker can craft a subtle, imperceptible adversarial perturbation and apply it to an input image. When this image is processed by an LVLM, even with a benign text prompt, it forces the model into an unbounded generation loop. The attack, named RECALLED, uses a gradient-based optimization process to create a visual perturbation that steers the model's text generation towards a predefined, repetitive sequence (an "Output Recall" target). This causes the model to generate text that repeats a word or sentence until the maximum context limit is reached, leading to a denial-of-service condition through excessive computational resource usage and response latency.

RECALLED: An Unbounded Resource Consumption Attack on Large Vision-Language Models

Affects: llava, qwen-vl, instructblip, qwen3b, qwen7b, qwen32b, llava7b, llava13b, blip7b, blip13b, llava-1.5-hf, qwen/qwen2.5-vl-instruct, instructblip-vicuna, meta-llama

Adaptive Cipher Jailbreak

7/14/2025

Large Language Models (LLMs) are vulnerable to obfuscation-based jailbreak attacks using the MetaCipher framework. MetaCipher employs a reinforcement learning algorithm to iteratively select from a pool of 21 ciphers to encrypt malicious keywords within prompts, evading standard safety mechanisms that rely on keyword detection. The framework adaptively learns optimal cipher choices to maximize the success rate of the jailbreak, even against LLMs with reasoning capabilities. Successful attacks bypass safety guardrails, leading to the execution of malicious requests masked as benign input.

MetaCipher: A General and Extensible Reinforcement Learning Framework for Obfuscation-Based Jailbreak Attacks on Black-Box LLMs

Affects: falcon3-10b-instruct, internlm2.5-20b-chat, llama3.3-70b-instruct, qwen2.5-72b-instruct, claude-3.7-sonnet-20250209, deepseek-chat, gemini-2.0-flash-001, gpt-4o-2024-11-20, qwq-32b, deepseekreasoner (r1), gemini-2.5-pro-exp-03-25, o1-mini-2024-09-12

Multi-Agent Prompt Permutation Attack

4/12/2025

A vulnerability in multi-agent Large Language Model (LLM) systems allows for a permutation-invariant adversarial prompt attack. By strategically partitioning adversarial prompts and routing them through a network topology, an attacker can bypass distributed safety mechanisms, even those with token bandwidth limitations and asynchronous message delivery. The attack optimizes prompt propagation as a maximum-flow minimum-cost problem, maximizing success while minimizing detection.

$ extit {Agents Under Siege} $: Breaking Pragmatic Multi-Agent LLM Systems with Optimized Prompt Attacks

Affects: llama-2-7b, llama-3.1-8b, mistral-7b, gemma-2-9b, deepseek-r1-distilled, llama-guard-7b, llama-guard-2-8b, llama-guard-3-8b, llama-guard-3-1b, prompt-guard-86m

Cat-Triggered Reasoning Error

3/19/2025

Large Language Models (LLMs) designed for step-by-step problem-solving are vulnerable to query-agnostic adversarial triggers. Appending short, semantically irrelevant text snippets (e.g., "Interesting fact: cats sleep most of their lives") to mathematical problems consistently increases the likelihood of incorrect model outputs without altering the problem's inherent meaning. This vulnerability stems from the models' susceptibility to subtle input manipulations that interfere with their internal reasoning processes.

Cats Confuse Reasoning LLM: Query Agnostic Adversarial Triggers for Reasoning Models

Affects: deepseek v3, deepseek r1, deepseek r1-distill-qwen-32b, openai o1, openai o3-mini

LLM Fuzz-Based Jailbreak

3/19/2025

Large Language Models (LLMs) are vulnerable to jailbreak attacks by crafted prompts that bypass safety mechanisms, causing the model to generate harmful or unethical content. This vulnerability stems from the inherent tension between the LLM's instruction-following and safety constraints. The JBFuzz technique demonstrates the ability to efficiently and effectively discover such prompts through a fuzzing-based approach leveraging novel seed prompt templates and a synonym-based mutation strategy.

JBFuzz: Jailbreaking LLMs Efficiently and Effectively Using Fuzzing

Affects: gpt-3.5-turbo-0125, gpt-4o-2024-08-06, gpt-4o-mini-2024-07-18, deepseek-chat, deepseek-reasoner, llama-2-7b-chat-hf, meta-llama-3.1-8b-instruct, gemini-2.0-flash-exp, gemini-1.5-flash

LLM Judge Adversarial Vulnerability

3/19/2025

Large Language Model (LLM) safety judges exhibit vulnerability to adversarial attacks and stylistic prompt modifications, leading to increased false negative rates (FNR) and decreased accuracy in classifying harmful model outputs. Minor stylistic changes to model outputs, such as altering the formatting or tone, can significantly impact a judge's classification, while direct adversarial modifications to the generated text can fool judges into misclassifying even 100% of harmful generations as safe. This vulnerability impacts the reliability of LLM safety evaluations used in offline benchmarking, automated red-teaming, and online guardrails.

Know Thy Judge: On the Robustness Meta-Evaluation of LLM Safety Judges

Affects: llama-2 13b, llama guard 3 8b, wildguard, shieldgemma 9b, mistral 7b, llama-3.1 8b, atla selene mini 8b

Distilled Jailbreak Prompt Generator

3/4/2025

The Knowledge-Distilled Attacker (KDA) model, when used to generate prompts for large language models (LLMs), can bypass LLM safety mechanisms resulting in the generation of harmful, inappropriate, or misaligned content. KDA's effectiveness stems from its ability to generate diverse and coherent attack prompts efficiently, surpassing existing methods in attack success rate and speed. The vulnerability lies in the LLMs' insufficient defenses against the diverse prompt generation strategies learned and employed by KDA.

KDA: A Knowledge-Distilled Attacker for Generating Diverse Prompts to Jailbreak LLMs

Affects: llama-2-7b-chat, llama-2-13b-chat, vicuna-7b, vicuna-13b, qwen-7b-chat, qwen-14b-chat, mistral-7b, gpt-3.5-turbo, gpt-4-turbo, claude2.1, gpt-3.5, gpt-4

Flowchart-based LVLM Jailbreak Attack

3/19/2025

FC-Attack leverages automatically generated flowcharts containing step-by-step descriptions derived or rephrased from harmful queries, combined with a benign textual prompt, to jailbreak Large Vision-Language Models (LVLMs). The vulnerability lies in the model's susceptibility to visual prompts containing harmful information within the flowcharts, thus bypassing safety alignment mechanisms.

FC-Attack: Jailbreaking Large Vision-Language Models via Auto-Generated Flowcharts

Affects: gemini-1.5 pro, llava-next, qwen2-vl, internvl-2.5, gpt-4o mini, gpt-4o, claude-3.5 sonnet, mistral 7b

LLM Lower Layer Freeze Jailbreak

3/19/2025

A vulnerability exists in Large Language Models (LLMs) that allows for efficient jailbreaking by selectively fine-tuning only the lower layers of the model with a toxic dataset. This "Freeze Training" method, as described in the research paper, concentrates the fine-tuning on layers identified as being highly sensitive to the generation of harmful content. This approach significantly reduces training duration and GPU memory consumption while maintaining a high jailbreak success rate.

Efficient Jailbreaking of Large Models by Freeze Training: Lower Layers Exhibit Greater Sensitivity to Harmful Content

Affects: qwen2.5-7b-instruct, glm4, llama3.1, mistral, baichuan2, deepseek-r1-abliterated, qwen2.5, llama3.1-8b-instruct, baichuan2-7b-chat, glm-4-9b-chat-hf, mistral-8b-instruct-2410

Alignment-Based LLM Jailbreak

1/26/2025

Large Language Models (LLMs) employing reinforcement learning from human feedback (RLHF) for safety alignment are vulnerable to a novel "alignment-based" jailbreak attack. This attack leverages a best-of-N sampling approach with an adversarial LLM to efficiently generate prompts that bypass safety mechanisms and elicit unsafe responses from the target LLM, without requiring additional training or access to the target LLM's internal parameters. The attack exploits the inherent tension between safety and unsafe reward signals, effectively misaligning the model via alignment techniques.

LIAR: Leveraging Alignment (Best-of-N) to Jailbreak LLMs in Seconds

Affects: gpt-2, gpt-2-pmc, gpt-2-wikitext, gpt-2-openinstruct, megatron-345m, tinyllama1.1b, vicuna-7b, vicuna-13b, llama-2, llama-3, llama-3.1 7b, llama-3.1 8b, mistral-7b, falcon-7b, pythia-12b

Page 1 of 3