Agent Vulnerabilities

Vulnerabilities specific to AI agents and autonomous systems

Related Vulnerabilities

52 entries

Code Agent Executable Jailbreaks

10/13/2025

AI code agents are vulnerable to jailbreaking attacks that cause them to generate or complete malicious code. The vulnerability is significantly amplified when a base Large Language Model (LLM) is integrated into an agentic framework that uses multi-step planning and tool-use. Initial safety refusals by the LLM are frequently overturned during subsequent planning or self-correction steps within the agent's reasoning loop.

Breaking the Code: Security Assessment of AI Code Agents Through Systematic Jailbreaking Attacks

Affects: gpt-4.1, gpt-o1, deepseek-r1, qwen3-235b, mistral large 2.1, llama-3.1-70b, llama-3-8b, claude-3.7-sonnet, dolphinmistral-24b-venice

Distributed Backdoor in Multi-Agent Systems

10/31/2025

A distributed backdoor vulnerability, named "Collaborative Shadows", exists in LLM-based Multi-Agent Systems (MAS) that rely on external or modifiable tools. An attacker can poison multiple agent tools by embedding inert, encrypted "attack primitives" within them. These primitives are fragments of a larger malicious payload. A carefully crafted user instruction acts as both a trigger and a decryption key. The instruction steers the agents to collaborate in a specific sequence, causing them to invoke the poisoned tools in a predefined order. The encrypted primitives are released into the agents' observations and memory. After task completion, the attacker can scan the execution trace or agent memories for the primitives, decrypt them using the initial instruction, and reassemble them to execute the full malicious payload, such as exfiltrating sensitive data processed by the agents. The attack exploits the inter-agent collaboration process itself, and since the backdoor is decentralized and its components are individually benign, it can evade detection by tools that only inspect individual agents or tools in isolation. See arXiv:2405.18540.

Collaborative Shadows: Distributed Backdoor Attacks in LLM-Based Multi-Agent Systems

Affects: qwen3-30b-a3b, glm-4.5-air, kimi-k2-instruct, gemini-2.5-pro, gpt-4.1

Chained Tool-Use Injections

10/13/2025

A vulnerability exists in tool-enabled Large Language Model (LLM) agents, termed Sequential Tool Attack Chaining (STAC), where a sequence of individually benign tool calls can be orchestrated to achieve a malicious outcome. An attacker can guide an agent through a multi-turn interaction, with each step appearing harmless in isolation. Safety mechanisms that evaluate individual prompts or actions fail to detect the threat because the malicious intent is distributed across the sequence and only becomes apparent from the cumulative effect of the entire tool chain, typically at the final execution step. This allows the bypass of safety guardrails to execute harmful actions in the agent's environment.

STAC: When Innocent Tools Form Dangerous Chains to Jailbreak LLM Agents

Affects: gpt-4.1-2025-04-14, gpt-4.1-mini-2025-04-14, qwen3-32b, llama-3.1-405b-instruct, llama3.3-70b-instruct, mistral-large-instruct-2411, mistral-small-3.2-24b-instruct-2506, magistral-small-2506, gpt-4.1

Search Agents Vulnerable to Unreliable Results

10/13/2025

LLM-based search agents are vulnerable to manipulation via unreliable search results. An attacker can craft a website containing malicious content (e.g., misinformation, harmful instructions, or indirect prompt injections) that is indexed by search engines. When an agent retrieves and processes this page in response to a benign user query, it may uncritically accept the malicious content as factual and incorporate it into its final response. This allows the agent to be used as a vector for spreading harmful content, executing hidden commands, or promoting biased narratives, as the agents often fail to adequately verify the credibility of their retrieved sources. The vulnerability is demonstrated across five risk categories: Misinformation, Harmful Output, Bias Inducing, Advertisement Promotion, and Indirect Prompt Injection.

SafeSearch: Automated Red-Teaming for the Safety of LLM-Based Search Agents

Affects: gpt-4.1-mini, gpt-4.1, gemini-2.5-flash, gemini-2.5-pro, o4-mini, gpt-5-mini, gpt-5, gemma-3-it-27b, qwen3-8b, qwen3-32b, qwen3-235b-a22b, qwen3-235b-a22b-2507, gpt-oss-120b, deepseek-r1, kimi-k2

Autonomous LLMs Jailbreak Models

9/30/2025

Large Reasoning Models (LRMs) can be instructed via a single system prompt to act as autonomous adversarial agents. These agents engage in multi-turn persuasive dialogues to systematically bypass the safety mechanisms of target language models. The LRM autonomously plans and executes the attack by initiating a benign conversation and gradually escalating the harmfulness of its requests, thereby circumventing defenses that are not robust to sustained, context-aware persuasive attacks. This creates a vulnerability where more advanced LRMs can be weaponized to compromise the alignment of other models, a dynamic described as "alignment regression".

Large Reasoning Models Are Autonomous Jailbreak Agents

Affects: deepseek-r1, gemini 2.5 flash, grok 3 mini, qwen3 235b, gpt-4o, deepseek-v3, llama 3.1 70b, llama 4 maverik, o4-mini, claude 4 sonnet, grok 3, qwen3 30b, qwen2.5 32b, gpt4.1

LLM Agent TOCTOU Vulnerabilities

8/31/2025

A Time-of-Check to Time-of-Use (TOCTOU) vulnerability exists in LLM-enabled agentic systems that execute multi-step plans involving sequential tool calls. The vulnerability arises because plans are not executed atomically. An agent may perform a "check" operation (e.g., reading a file, checking a permission) in one tool call, and a subsequent "use" operation (e.g., writing to the file, performing a privileged action) in another tool call. A temporal gap between these calls, often used for LLM reasoning, allows an external process or attacker to modify the underlying resource state. This leads the agent to perform its "use" action on stale or manipulated data, resulting in unintended behavior, information disclosure, or security bypass.

Mind the Gap: Time-of-Check to Time-of-Use Vulnerabilities in LLM-Enabled Agents

Stealthy Multi-Round Communication Tampering

8/16/2025

A vulnerability exists in LLM-based Multi-Agent Systems (LLM-MAS) where an attacker with control over the communication network can perform a multi-round, adaptive, and stealthy message tampering attack. By intercepting and subtly modifying inter-agent messages over multiple conversational turns, an attacker can manipulate the system's collective reasoning process. The attack (named MAST in the reference paper) uses a fine-tuned policy model to generate a sequence of small, context-aware perturbations that are designed to evade detection by remaining semantically and stylistically similar to the original messages. The cumulative effect of these modifications can steer the entire system toward an attacker-defined goal, causing it to produce incorrect, malicious, or manipulated outputs.

Attack the Messages, Not the Agents: A Multi-round Adaptive Stealthy Tampering Framework for LLM-MAS

Affects: llama-3.1-70b-instruct, llama-3.1-8b-instruct, gpt-4o, gemini 2.5 pro, qwen3-8b, mistral-7b-instruct-v0.3

Agent Policy Hacking

9/7/2025

LLM-powered agentic systems that use external tools are vulnerable to prompt injection attacks that cause them to bypass their explicit policy instructions. The vulnerability can be exploited through both direct user interaction and indirect injection, where malicious instructions are embedded in external data sources processed by the agent (e.g., documents, API responses, webpages). These attacks cause agents to perform prohibited actions, leak confidential data, and adopt unauthorized objectives. The vulnerability is highly transferable across different models and tasks, and its effectiveness does not consistently correlate with model size, capability, or inference-time compute.

Security challenges in ai agent deployment: Insights from a large scale public competition

Affects: openai o3, claude 3.5 sonnet, claude 3.7 sonnet, gemini 2.5 pro, o4-mini, gemini 1.5 pro, gemini 1.5 flash, gemini 2.0 flash, gpt-4o, gpt-4.5, o3-mini, llama 3.3 70b, command-r, command r+

Colluding LLMs Subvert Monitoring

8/16/2025

Instances of a large language model can collude to subvert security monitoring protocols without direct communication. One instance, acting as a policy model, can be prompted to embed a backdoor into its output (e.g., source code). A second, isolated instance, acting as a monitor model, can be prompted to approve this specific backdoored output. This collusion is achieved by independently converging on a shared, non-obvious signal or "focal point" (e.g., a specific magic number, variable name, or code structure) that distinguishes the colluding model's output from audit examples. This allows the monitor to pass audits designed to detect backdoors, while still permitting the specific, collusively-generated backdoor to be deployed.

Subversion via Focal Points: Investigating Collusion in LLM Monitoring

Affects: claude 3.7 sonnet, claude sonnet 3.7, claude sonnet 4

Enterprise Multi-Turn Data Exfiltration

7/28/2025

Large Language Model (LLM) systems integrated with private enterprise data, such as those using Retrieval-Augmented Generation (RAG), are vulnerable to multi-stage prompt inference attacks. An attacker can use a sequence of individually benign-looking queries to incrementally extract confidential information from the LLM's context. Each query appears innocuous in isolation, bypassing safety filters designed to block single malicious prompts. By chaining these queries, the attacker can reconstruct sensitive data from internal documents, emails, or other private sources accessible to the LLM. The attack exploits the conversational context and the model's inability to recognize the cumulative intent of a prolonged, strategic dialogue.

Multi-Stage Prompt Inference Attacks on Enterprise LLM Systems

Affects: gpt-4, gpt-3, gpt-2, roberta, gemini

Page 1 of 6