Attacks that inject malicious content into model inputs
AI code agents are vulnerable to jailbreaking attacks that cause them to generate or complete malicious code. The vulnerability is significantly amplified when a base Large Language Model (LLM) is integrated into an agentic framework that uses multi-step planning and tool-use. Initial safety refusals by the LLM are frequently overturned during subsequent planning or self-correction steps within the agent's reasoning loop.
A vulnerability exists in certain safety-aligned Large Language Models (LLMs) due to an untargeted, gradient-based optimization attack method called Untargeted Jailbreak Attack (UJA). Unlike previous targeted attacks (e.g., GCG) that optimize a prompt to elicit a predefined string (e.g., "Sure, here is..."), UJA optimizes for a general objective: maximizing the unsafety probability of the model's response, as quantified by an external judge model.
unsafety
Large Language Models from multiple vendors are vulnerable to a "Camouflaged Jailbreak" attack. Malicious instructions are embedded within seemingly benign, technically complex prompts, often framed as system design or engineering problems. The models fail to recognize the harmful intent implied by the context and technical specifications, bypassing safety filters that rely on detecting explicit keywords. This leads to the generation of detailed, technically plausible instructions for creating dangerous devices or systems. The attack has a high success rate, with models demonstrating full obedience in over 94% of tested cases, treating the harmful requests as legitimate.
A zero-click indirect prompt injection vulnerability, CVE-2025-32711, existed in Microsoft 365 Copilot. A remote, unauthenticated attacker could exfiltrate sensitive data from a victim's session by sending a crafted email. When Copilot later processed this email as part of a user's query, hidden instructions caused it to retrieve sensitive data from the user's context (e.g., other emails, documents) and embed it into a URL. The attack chain involved bypassing Microsoft's XPIA prompt injection classifier, evading link redaction filters using reference-style Markdown, and abusing a trusted Microsoft Teams proxy domain to bypass the client-side Content Security Policy (CSP), resulting in automatic data exfiltration without any user interaction.
A vulnerability exists in multiple Large Language Models (LLMs) where an attacker can bypass safety alignments by exploiting the model's ethical reasoning capabilities. The attack, named TRIAL (Trolley-problem Reasoning for Interactive Attack Logic), frames a harmful request within a multi-turn ethical dilemma modeled on the trolley problem. The harmful action is presented as the "lesser of two evils" necessary to prevent a catastrophic outcome, compelling the model to engage in utilitarian justification. This creates a conflict between the model's deontological safety rules (e.g., "do not generate harmful content") and the consequentialist logic of the scenario. Through a series of iterative, context-aware queries, the attacker progressively reinforces the model's commitment to the harmful path, leading it to generate content it would normally refuse. The vulnerability is paradoxically more effective against models with more advanced reasoning abilities.
A vulnerability exists in multiple Large Language Models (LLMs) where safety alignment mechanisms can be bypassed by reframing harmful instructions as "learning-style" or academic questions. This technique, named Hiding Intention by Learning from LLMs (HILL), transforms direct, harmful requests into exploratory questions using simple hypotheticality indicators (e.g., "for academic curiosity", "in the movie") and detail-oriented inquiries (e.g., "provide a step-by-step breakdown"). The attack exploits the models' inherent helpfulness and their training on academic and explanatory text, causing them to generate harmful content that they would otherwise refuse.
A vulnerability exists in aligned Large Language Models (LLMs) where a harmful instruction can be obfuscated through a multi-step formalization process, bypassing safety mechanisms. The attack, named Prompt Jailbreaking via Semantic and Structural Formalization (PASS), uses a Reinforcement Learning (RL) agent to dynamically construct an adversarial prompt. The agent learns to apply a sequence of actions—such as symbolic abstraction, logical encoding, mathematical representation, metaphorical transformation, and strategic decomposition—to an initial harmful query. This iterative process transforms the query into a representation that is semantically equivalent in intent but structurally unrecognizable to the model's safety filters, resulting in the generation of prohibited content. The attack is adaptive and does not rely on fixed templates.
LLM-based search agents are vulnerable to manipulation via unreliable search results. An attacker can craft a website containing malicious content (e.g., misinformation, harmful instructions, or indirect prompt injections) that is indexed by search engines. When an agent retrieves and processes this page in response to a benign user query, it may uncritically accept the malicious content as factual and incorporate it into its final response. This allows the agent to be used as a vector for spreading harmful content, executing hidden commands, or promoting biased narratives, as the agents often fail to adequately verify the credibility of their retrieved sources. The vulnerability is demonstrated across five risk categories: Misinformation, Harmful Output, Bias Inducing, Advertisement Promotion, and Indirect Prompt Injection.
Large Reasoning Models (LRMs) can be instructed via a single system prompt to act as autonomous adversarial agents. These agents engage in multi-turn persuasive dialogues to systematically bypass the safety mechanisms of target language models. The LRM autonomously plans and executes the attack by initiating a benign conversation and gradually escalating the harmfulness of its requests, thereby circumventing defenses that are not robust to sustained, context-aware persuasive attacks. This creates a vulnerability where more advanced LRMs can be weaponized to compromise the alignment of other models, a dynamic described as "alignment regression".
Large language models that support a developer role in their API are vulnerable to a jailbreaking attack that leverages malicious developer messages. An attacker can craft a developer message that overrides the model's safety alignment by setting a permissive persona, providing explicit instructions to bypass refusals, and using few-shot examples of harmful query-response pairs. This technique, named D-Attack, is effective on its own. A more advanced variant, DH-CoT, enhances the attack by aligning the developer message's context (e.g., an educational setting) with a hijacked Chain-of-Thought (H-CoT) user prompt, significantly increasing its success rate against reasoning-optimized models that are otherwise resistant to simpler jailbreaks.
developer
© 2025 Promptfoo. All rights reserved.