Vulnerabilities in model API implementations
Large language models that support a developer role in their API are vulnerable to a jailbreaking attack that leverages malicious developer messages. An attacker can craft a developer message that overrides the model's safety alignment by setting a permissive persona, providing explicit instructions to bypass refusals, and using few-shot examples of harmful query-response pairs. This technique, named D-Attack, is effective on its own. A more advanced variant, DH-CoT, enhances the attack by aligning the developer message's context (e.g., an educational setting) with a hijacked Chain-of-Thought (H-CoT) user prompt, significantly increasing its success rate against reasoning-optimized models that are otherwise resistant to simpler jailbreaks.
developer
Large Language Models (LLMs) equipped with native code interpreters are vulnerable to Denial of Service (DoS) via resource exhaustion. An attacker can craft a single prompt that causes the interpreter to execute code that depletes CPU, memory, or disk resources. The vulnerability is particularly pronounced when a resource-intensive task is framed within a plausibly benign or socially-engineered context ("indirect prompts"), which significantly lowers the model's likelihood of refusal compared to explicitly malicious requests.
VERA, a variational inference framework, enables the generation of diverse and fluent adversarial prompts that bypass safety mechanisms in large language models (LLMs). The attacker model, trained through a variational objective, learns a distribution of prompts likely to elicit harmful responses, effectively jailbreaking the target LLM. This allows for the generation of novel attacks that are not based on pre-existing, manually crafted prompts.
A vulnerability exists in Large Language Models (LLMs) that allows attackers to manipulate the model's output by modifying token log probabilities. Attackers can use a lightweight plug-in model (BiasNet) to subtly alter the probabilities, steering the LLM toward generating harmful content even when safety mechanisms are in place. This attack requires only access to the top-k token log probabilities returned by the LLM's API, without needing model weights or internal access.
Large Language Models (LLMs) are vulnerable to Dialogue Injection Attacks (DIA), where malicious actors manipulate the chat history to bypass safety mechanisms and elicit harmful or unethical responses. DIA exploits the LLM's chat template structure to inject crafted dialogue into the input, even in black-box scenarios where the model's internals are unknown. Two attack methods are presented: one adapts gray-box prefilling attacks, the other leverages deferred responses to increase the likelihood of successful jailbreaks.
Large Language Models (LLMs) with structured output APIs (e.g., using JSON Schema) are vulnerable to Constrained Decoding Attacks (CDAs). CDAs exploit the control plane of the LLM's decoding process by embedding malicious intent within the schema-level grammar rules, bypassing safety mechanisms that primarily focus on input prompts. The attack manipulates the allowed output space, forcing the LLM to generate harmful content despite a benign input prompt. One instance of a CDA is the Chain Enum Attack, which leverages JSON Schema's enum feature to inject malicious options into the allowed output, achieving high success rates.
enum
FC-Attack leverages automatically generated flowcharts containing step-by-step descriptions derived or rephrased from harmful queries, combined with a benign textual prompt, to jailbreak Large Vision-Language Models (LVLMs). The vulnerability lies in the model's susceptibility to visual prompts containing harmful information within the flowcharts, thus bypassing safety alignment mechanisms.
Large Language Models (LLMs) with structured output interfaces are vulnerable to jailbreak attacks that exploit the interaction between token-level inference and sentence-level safety alignment. Attackers can manipulate the model's output by constructing attack patterns based on prefixes of safety refusal responses and desired harmful outputs, effectively bypassing safety mechanisms through iterative API calls and constrained decoding. This allows the generation of harmful content despite safety measures.
CVE-2024-XXXX
© 2025 Promptfoo. All rights reserved.