Vulnerabilities in prompt handling and processing
A vulnerability exists in multiple Large Language Models (LLMs) that allows for safety alignment bypass through a technique named Activation-Guided Local Editing (AGILE). The attack uses white-box access to a source model's internal states (activations and attention scores) to craft a transferable text-based prompt that elicits harmful content.
Large Language Models (LLMs) are vulnerable to automated adversarial attacks that systematically combine multiple jailbreaking "primitives" into complex prompt chains. A dynamic optimization engine can generate and test billions of unique combinations of techniques (e.g., low-resource language translation, payload splitting, role-playing) to bypass safety guardrails. This combinatorial approach differs from manual red-teaming by systematically exploring the attack surface, achieving near-universal success in eliciting harmful content. The vulnerability lies in the models' inability to maintain safety alignment when faced with a sequence of layered obfuscation and manipulation techniques.
A Time-of-Check to Time-of-Use (TOCTOU) vulnerability exists in LLM-enabled agentic systems that execute multi-step plans involving sequential tool calls. The vulnerability arises because plans are not executed atomically. An agent may perform a "check" operation (e.g., reading a file, checking a permission) in one tool call, and a subsequent "use" operation (e.g., writing to the file, performing a privileged action) in another tool call. A temporal gap between these calls, often used for LLM reasoning, allows an external process or attacker to modify the underlying resource state. This leads the agent to perform its "use" action on stale or manipulated data, resulting in unintended behavior, information disclosure, or security bypass.
Large language models that support a developer role in their API are vulnerable to a jailbreaking attack that leverages malicious developer messages. An attacker can craft a developer message that overrides the model's safety alignment by setting a permissive persona, providing explicit instructions to bypass refusals, and using few-shot examples of harmful query-response pairs. This technique, named D-Attack, is effective on its own. A more advanced variant, DH-CoT, enhances the attack by aligning the developer message's context (e.g., an educational setting) with a hijacked Chain-of-Thought (H-CoT) user prompt, significantly increasing its success rate against reasoning-optimized models that are otherwise resistant to simpler jailbreaks.
developer
A universal prompt injection vulnerability, termed "Involuntary Jailbreak," affects multiple large language models. The attack uses a single prompt that instructs the model to learn a pattern from abstract string operators (X and Y). The model is then asked to generate its own examples of questions that should be refused (harmful questions) and provide detailed, non-refusal answers to them, in order to satisfy the learned operator logic. This reframes the generation of harmful content as a logical puzzle, causing the model to bypass its safety and alignment training. The vulnerability is untargeted, allowing it to elicit a wide spectrum of harmful content without the attacker specifying a malicious goal.
X
Y
Large Language Models (LLMs) are vulnerable to a jailbreak attack termed Paper Summary Attack (PSA). An attacker can bypass safety alignment mechanisms by framing a harmful query within the context of a summarized academic paper. The model's propensity to trust the authoritative structure and tone of a research paper summary overrides its safety filters, leading it to process and respond to the embedded malicious instruction. The vulnerability is particularly potent when using summaries of papers on LLM safety itself (both attack and defense-focused research), exposing significant and differing alignment biases across models.
LLM-powered agentic systems that use external tools are vulnerable to prompt injection attacks that cause them to bypass their explicit policy instructions. The vulnerability can be exploited through both direct user interaction and indirect injection, where malicious instructions are embedded in external data sources processed by the agent (e.g., documents, API responses, webpages). These attacks cause agents to perform prohibited actions, leak confidential data, and adopt unauthorized objectives. The vulnerability is highly transferable across different models and tasks, and its effectiveness does not consistently correlate with model size, capability, or inference-time compute.
A vulnerability exists in Diffusion-based Large Language Models (dLLMs) that allows for bypassing safety alignment mechanisms through interleaved mask-text prompts. The vulnerability stems from two core architectural features of dLLMs: bidirectional context modeling and parallel decoding. The model's drive to maintain contextual consistency forces it to fill masked tokens with content that aligns with the surrounding, potentially malicious, text. The parallel decoding process prevents dynamic content filtering or rejection sampling during generation, which are common defense mechanisms in autoregressive models. This allows an attacker to elicit harmful or policy-violating content by explicitly stating a malicious request and inserting mask tokens where the harmful output should be generated.
Large Language Model (LLM) systems integrated with private enterprise data, such as those using Retrieval-Augmented Generation (RAG), are vulnerable to multi-stage prompt inference attacks. An attacker can use a sequence of individually benign-looking queries to incrementally extract confidential information from the LLM's context. Each query appears innocuous in isolation, bypassing safety filters designed to block single malicious prompts. By chaining these queries, the attacker can reconstruct sensitive data from internal documents, emails, or other private sources accessible to the LLM. The attack exploits the conversational context and the model's inability to recognize the cumulative intent of a prolonged, strategic dialogue.
Large Language Models (LLMs) equipped with native code interpreters are vulnerable to Denial of Service (DoS) via resource exhaustion. An attacker can craft a single prompt that causes the interpreter to execute code that depletes CPU, memory, or disk resources. The vulnerability is particularly pronounced when a resource-intensive task is framed within a plausibly benign or socially-engineered context ("indirect prompts"), which significantly lowers the model's likelihood of refusal compared to explicitly malicious requests.
© 2025 Promptfoo. All rights reserved.