Vulnerabilities specific to AI agents and autonomous systems
A vulnerability in multi-agent Large Language Model (LLM) systems allows for a permutation-invariant adversarial prompt attack. By strategically partitioning adversarial prompts and routing them through a network topology, an attacker can bypass distributed safety mechanisms, even those with token bandwidth limitations and asynchronous message delivery. The attack optimizes prompt propagation as a maximum-flow minimum-cost problem, maximizing success while minimizing detection.
LLM agents utilizing external tools are vulnerable to indirect prompt injection (IPI) attacks. Attackers can embed malicious instructions into the external data accessed by the agent, manipulating its behavior even when defenses against direct prompt injection are in place. Adaptive attacks, which modify the injected payload based on the specific defense mechanism, consistently bypass existing defenses with a success rate exceeding 50%.
A vulnerability exists in Large Language Model (LLM) agents that allows attackers to manipulate the agent's reasoning process through the insertion of strategically placed adversarial strings. This allows attackers to induce the agent to perform unintended malicious actions or invoke specific malicious tools, even when the initial prompt or instruction is benign. The attack exploits the agent's reliance on chain-of-thought reasoning and dynamically optimizes the adversarial string to maximize the likelihood of the agent incorporating malicious actions into its reasoning path.
Large Language Models (LLMs) are vulnerable to multi-turn adversarial attacks that exploit incremental policy erosion. The attacker uses a breadth-first search strategy to generate multiple prompts at each turn, leveraging partial compliance from previous responses to gradually escalate the conversation towards eliciting disallowed outputs. Minor concessions accumulate, ultimately leading to complete circumvention of safety measures.
A vulnerability exists in large language models (LLMs) where insufficient sanitization of system prompts allows attackers to extract sensitive information embedded within those prompts. Attackers can use an agentic approach, employing multiple interacting LLMs (as demonstrated in the referenced research), to iteratively refine prompts and elicit confidential data from the target LLM's responses. The vulnerability is exacerbated by the LLM's ability to infer context from seemingly innocuous prompts.
FC-Attack leverages automatically generated flowcharts containing step-by-step descriptions derived or rephrased from harmful queries, combined with a benign textual prompt, to jailbreak Large Vision-Language Models (LVLMs). The vulnerability lies in the model's susceptibility to visual prompts containing harmful information within the flowcharts, thus bypassing safety alignment mechanisms.
CVE-2024-XXXX
Large Language Model (LLM) tool-calling systems are vulnerable to adversarial tool injection attacks. Attackers can inject malicious tools ("Manipulator Tools") into the tool platform, manipulating the LLM's tool selection and execution process. This allows for privacy theft (extracting user queries), denial-of-service (DoS) attacks against legitimate tools, and unscheduled tool-calling (forcing the use of attacker-specified tools regardless of relevance). The attack exploits vulnerabilities in the tool retrieval mechanism and the LLM's decision-making process. Successful attacks require the malicious tool to be (1) retrieved by the system, (2) selected for execution by the LLM, and (3) its output to manipulate subsequent LLM actions.
Large Audio-Language Models (LALMs) are vulnerable to a stealthy adversarial jailbreak attack, AdvWave, which leverages a dual-phase optimization to overcome gradient shattering caused by audio discretization. The attack crafts adversarial audio by adding perceptually realistic environmental noise, making it difficult to detect. The attack also dynamically adapts the adversarial target based on the LALM's response patterns.
A bimodal adversarial attack, PBI-Attack, can manipulate Large Vision-Language Models (LVLMs) into generating toxic or harmful content by iteratively optimizing both textual and visual inputs in a black-box setting. The attack leverages a surrogate LVLM to inject malicious features from a harmful corpus into a benign image, then iteratively refines both image and text perturbations to maximize the toxicity of the model’s output as measured by a toxicity detection model (Perspective API or Detoxify).
© 2025 Promptfoo. All rights reserved.