Injection Vulnerabilities

Attacks that inject malicious content into model inputs

Related Vulnerabilities

102 entries

Hybrid LLM Jailbreak Strategy

7/14/2025

A hybrid jailbreak attack, combining gradient-guided token optimization (GCG) with iterative prompt refinement (PAIR or WordGame+), bypasses LLM safety mechanisms resulting in the generation of disallowed content. The hybrid approach leverages the strengths of both techniques, circumventing defenses effective against single-mode attacks. Specifically, the combination of semantically crafted prompts and strategically placed adversarial tokens confuse and overwhelm existing defenses.

Advancing Jailbreak Strategies: A Hybrid Approach to Exploiting LLM Vulnerabilities and Bypassing Modern Defenses

Affects: vicuna-7b, llama 2, llama 3, gpt-3.5 (gpt-3.5-turbo), gpt-4 (gpt-4-0314), claude-1, claude-2, mistral-sorry-bench, meta-llama-guard-2-8b, deepseek-r1:70b, llama-3-8b, vicuna-7b-v1.5, llama2-7b

Staged LLM Pipeline Attack

7/14/2025

Large language models (LLMs) protected by multi-stage safeguard pipelines (input and output classifiers) are vulnerable to staged adversarial attacks (STACK). STACK exploits weaknesses in individual components sequentially, combining jailbreaks for each classifier with a jailbreak for the underlying LLM to bypass the entire pipeline. Successful attacks achieve high attack success rates (ASR), even on datasets of particularly harmful queries.

STACK: Adversarial Attacks on LLM Safeguard Pipelines

Affects: claude 4 opus, qwen3-14b, gemma-2-9b, llama-3-8b-instruct, gpt-4-1106-preview, gpt-4o-2024-08-06

Stealthy Unlearning Degradation

6/30/2025

A vulnerability in fine-tuning-based large language model (LLM) unlearning allows malicious actors to craft manipulated forgetting requests. By subtly increasing the frequency of common benign tokens within the forgetting data, the attacker can cause the unlearned model to exhibit unintended unlearning behaviors when these benign tokens appear in normal user prompts, leading to a degradation of model utility for legitimate users. This occurs because existing unlearning methods fail to effectively distinguish between benign tokens and those truly related to the target knowledge being unlearned.

Keeping an eye on llm unlearning: The hidden risk and remedy

Affects: llama 3.1 (8b), mistral v0.3 (7b)

Agent Red-Teaming via Fuzzing

7/14/2025

Large Language Model (LLM) agents are vulnerable to indirect prompt injection attacks through manipulation of external data sources accessed during task execution. Attackers can embed malicious instructions within this external data, causing the LLM agent to perform unintended actions, such as navigating to arbitrary URLs or revealing sensitive information. The vulnerability stems from insufficient sanitization and validation of external data before it's processed by the LLM.

AgentVigil: Generic Black-Box Red-teaming for Indirect Prompt Injection against LLM Agents

Affects: o3-mini, gpt-4o, gpt-4o-mini, claude-3.5-sonnet, gemini-2-flash-exp, llama3-8b

Hidden Image Jailbreak

5/31/2025

Multimodal large language models (MLLMs) are vulnerable to implicit jailbreak attacks that leverage least significant bit (LSB) steganography to conceal malicious instructions within images. These instructions are coupled with seemingly benign image-related text prompts, causing the MLLM to execute the hidden malicious instructions. The attack bypasses existing safety mechanisms by exploiting cross-modal reasoning capabilities.

Implicit Jailbreak Attacks via Cross-Modal Information Concealment on Vision-Language Models

Affects: gpt-4o, gemini 1.5 pro, qwen2.5-vl-72b, gpt-4.5, gemini2.5-pro, intervl2-8b

Informed Adversary LLM Jailbreak

5/31/2025

Large Language Models (LLMs) employing alignment-based defenses against prompt injection and jailbreak attacks exhibit vulnerability to an informed white-box attack. This attack, termed Checkpoint-GCG, leverages intermediate model checkpoints from the alignment training process to initialize the Greedy Coordinate Gradient (GCG) attack. By using each checkpoint as a stepping stone, Checkpoint-GCG successfully finds adversarial suffixes that bypass defenses achieving significantly higher attack success rates than standard GCG initialized with naive methods. This is particularly impactful as Checkpoint-GCG can discover universal adversarial suffixes effective across multiple inputs.

Alignment Under Pressure: The Case for Informed Adversaries When Evaluating LLM Defenses

Affects: llama3-8b-instruct, mistral-7b-instruct, gpt-3-turbo, llama-3, gpt-4o

LLM Judge Prompt Injection

5/31/2025

Large Language Models (LLMs) used for evaluating text quality (LLM-as-a-Judge architectures) are vulnerable to prompt-injection attacks. Maliciously crafted suffixes appended to input text can manipulate the LLM's judgment, causing it to incorrectly favor a predetermined response even if another response is objectively superior. Two attack vectors are identified: Comparative Undermining Attack (CUA), directly targeting the final decision, and Justification Manipulation Attack (JMA), altering the model's generated reasoning.

Investigating the Vulnerability of LLM-as-a-Judge Architectures to Prompt-Injection Attacks

Affects: qwen2.5-3b-instruct, falcon3-3b-instruct

Steganographic LLM Jailbreak

5/31/2025

A steganographic jailbreak attack, termed StegoAttack, allows bypassing safety mechanisms in Large Language Models (LLMs) by embedding malicious queries within benign-appearing text. The attack hides the malicious query in the first word of each sentence of a seemingly innocuous paragraph, leveraging the LLM's autoregressive generation to process and respond to the hidden query, even when employing encryption in the response.

When Safety Detectors Aren't Enough: A Stealthy and Effective Jailbreak Attack on LLMs via Steganographic Techniques

Affects: gpt-o3, llama 4, deepseek-r1, qwq-32b

LLM Guardrail Evasion

4/21/2025

Large Language Model (LLM) guardrail systems, including those relying on AI-driven text classification models (e.g., fine-tuned BERT models), are vulnerable to evasion via character injection and adversarial machine learning (AML) techniques. Attackers can bypass detection by injecting Unicode characters (e.g., zero-width characters, homoglyphs) or using AML to subtly perturb prompts, maintaining semantic meaning while evading classification. This allows malicious prompts and jailbreaks to reach the underlying LLM.

Bypassing Prompt Injection and Jailbreak Detection in LLM Guardrails

Affects: deberta-v3-base, mdeberta-v3-base, gpt-4-mini

Multi-Accent Audio Jailbreak

4/12/2025

Multilingual and multi-accent audio inputs, combined with acoustic adversarial perturbations (reverberation, echo, whisper effects), can bypass safety mechanisms in Large Audio Language Models (LALMs), causing them to generate unsafe or harmful outputs. The vulnerability is amplified by the interaction between acoustic and linguistic variations, particularly in languages with less training data.

Multilingual and Multi-Accent Jailbreaking of Audio LLMs

Affects: qwen2-audio, diva-llama-3-v0-8b, meralion-audiollm-whisper-sea-lion, minicpm-o-2.6, ultravox-v0-4.1-llama-3.1-8b

Page 1 of 11