Large Language Models (LLMs) are vulnerable to a novel class of jailbreak attacks generated through the evolutionary synthesis of executable, code-based attack algorithms. Unlike traditional methods that refine or combine static prompts, this technique uses an automated multi-agent system (EvoSynth) to autonomously engineer and evolve the underlying code that generates the attack. These generated algorithms exhibit high structural and dynamic complexity, using features like control flow, state management, and multi-layer obfuscation to create highly evasive prompts. The attack's success against robust models correlates with the programmatic complexity of the generating algorithm (e.g., Abstract Syntax Tree node count and calls to external tools), demonstrating a vulnerability to procedurally generated narratives that current safety mechanisms do not effectively detect.
Related Vulnerabilities
AI code agents are vulnerable to jailbreaking attacks that cause them to generate or complete malicious code. The vulnerability is significantly amplified when a base Large Language Model (LLM) is integrated into an agentic framework that uses multi-step planning and tool-use. Initial safety refusals by the LLM are frequently overturned during subsequent planning or self-correction steps within the agent's reasoning loop.
A distributed backdoor vulnerability, named "Collaborative Shadows", exists in LLM-based Multi-Agent Systems (MAS) that rely on external or modifiable tools. An attacker can poison multiple agent tools by embedding inert, encrypted "attack primitives" within them. These primitives are fragments of a larger malicious payload. A carefully crafted user instruction acts as both a trigger and a decryption key. The instruction steers the agents to collaborate in a specific sequence, causing them to invoke the poisoned tools in a predefined order. The encrypted primitives are released into the agents' observations and memory. After task completion, the attacker can scan the execution trace or agent memories for the primitives, decrypt them using the initial instruction, and reassemble them to execute the full malicious payload, such as exfiltrating sensitive data processed by the agents. The attack exploits the inter-agent collaboration process itself, and since the backdoor is decentralized and its components are individually benign, it can evade detection by tools that only inspect individual agents or tools in isolation. See arXiv:2405.18540.
Large Language Models (LLMs) that use special tokens to define conversational structure (e.g., via chat templates) are vulnerable to a jailbreak attack named MetaBreak. An attacker can inject these special tokens, or regular tokens with high semantic similarity in the embedding space, into a user prompt. This manipulation allows the attacker to bypass the model's internal safety alignment and external content moderation systems. The attack leverages four primitives:
- Response Injection: Forging an assistant's turn within the user prompt to trick the model into believing it has already started to provide an affirmative response.
- Turn Masking: Using a few-shot, word-by-word construction to make the injected response resilient to disruption from platform-added chat template wrappers.
- Input Segmentation: Splitting sensitive keywords with injected special tokens to evade detection by content moderators, which may fail to reconstruct the original term, while the more capable target LLM can.
- Semantic Mimicry: Bypassing special token sanitization defenses by substituting them with regular tokens that have a minimal L2 norm distance in the embedding space, thereby preserving the token's structural function.
A vulnerability exists in tool-enabled Large Language Model (LLM) agents, termed Sequential Tool Attack Chaining (STAC), where a sequence of individually benign tool calls can be orchestrated to achieve a malicious outcome. An attacker can guide an agent through a multi-turn interaction, with each step appearing harmless in isolation. Safety mechanisms that evaluate individual prompts or actions fail to detect the threat because the malicious intent is distributed across the sequence and only becomes apparent from the cumulative effect of the entire tool chain, typically at the final execution step. This allows the bypass of safety guardrails to execute harmful actions in the agent's environment.
A zero-click indirect prompt injection vulnerability, CVE-2025-32711, existed in Microsoft 365 Copilot. A remote, unauthenticated attacker could exfiltrate sensitive data from a victim's session by sending a crafted email. When Copilot later processed this email as part of a user's query, hidden instructions caused it to retrieve sensitive data from the user's context (e.g., other emails, documents) and embed it into a URL. The attack chain involved bypassing Microsoft's XPIA prompt injection classifier, evading link redaction filters using reference-style Markdown, and abusing a trusted Microsoft Teams proxy domain to bypass the client-side Content Security Policy (CSP), resulting in automatic data exfiltration without any user interaction.
LLM-based search agents are vulnerable to manipulation via unreliable search results. An attacker can craft a website containing malicious content (e.g., misinformation, harmful instructions, or indirect prompt injections) that is indexed by search engines. When an agent retrieves and processes this page in response to a benign user query, it may uncritically accept the malicious content as factual and incorporate it into its final response. This allows the agent to be used as a vector for spreading harmful content, executing hidden commands, or promoting biased narratives, as the agents often fail to adequately verify the credibility of their retrieved sources. The vulnerability is demonstrated across five risk categories: Misinformation, Harmful Output, Bias Inducing, Advertisement Promotion, and Indirect Prompt Injection.
A Time-of-Check to Time-of-Use (TOCTOU) vulnerability exists in LLM-enabled agentic systems that execute multi-step plans involving sequential tool calls. The vulnerability arises because plans are not executed atomically. An agent may perform a "check" operation (e.g., reading a file, checking a permission) in one tool call, and a subsequent "use" operation (e.g., writing to the file, performing a privileged action) in another tool call. A temporal gap between these calls, often used for LLM reasoning, allows an external process or attacker to modify the underlying resource state. This leads the agent to perform its "use" action on stale or manipulated data, resulting in unintended behavior, information disclosure, or security bypass.
Large language models that support a developer role in their API are vulnerable to a jailbreaking attack that leverages malicious developer messages. An attacker can craft a developer message that overrides the model's safety alignment by setting a permissive persona, providing explicit instructions to bypass refusals, and using few-shot examples of harmful query-response pairs. This technique, named D-Attack, is effective on its own. A more advanced variant, DH-CoT, enhances the attack by aligning the developer message's context (e.g., an educational setting) with a hijacked Chain-of-Thought (H-CoT) user prompt, significantly increasing its success rate against reasoning-optimized models that are otherwise resistant to simpler jailbreaks.
A vulnerability exists in LLM-based Multi-Agent Systems (LLM-MAS) where an attacker with control over the communication network can perform a multi-round, adaptive, and stealthy message tampering attack. By intercepting and subtly modifying inter-agent messages over multiple conversational turns, an attacker can manipulate the system's collective reasoning process. The attack (named MAST in the reference paper) uses a fine-tuned policy model to generate a sequence of small, context-aware perturbations that are designed to evade detection by remaining semantically and stylistically similar to the original messages. The cumulative effect of these modifications can steer the entire system toward an attacker-defined goal, causing it to produce incorrect, malicious, or manipulated outputs.
© 2025 Promptfoo. All rights reserved.