Attacks requiring no knowledge of model internals
Large Language Models (LLMs) exhibit Defense Threshold Decay (DTD): generating substantial benign content shifts the model's attention from the input prompt to prior outputs, increasing susceptibility to jailbreak attacks. The "Sugar-Coated Poison" (SCP) attack exploits this by first generating benign content, then transitioning to malicious output.
A vulnerability exists in the combination of Large Language Models (LLMs) and their associated safety guardrails, allowing attackers to bypass both defenses and elicit harmful or unintended outputs from LLMs. The vulnerability stems from insufficient detection by guardrails against adversarially crafted prompts, which appear benign but contain hidden malicious intent. The attack, dubbed "DualBreach," leverages a target-driven initialization strategy and multi-target optimization to generate these prompts, effectively bypassing both the guardrail and LLM's internal safety mechanisms.
A vulnerability in Large Language Models (LLMs) allows attackers to bypass safety mechanisms and elicit detailed harmful responses by strategically manipulating input prompts. The vulnerability exploits the LLM's sensitivity to "scenario shifts"—contextual changes in the input that influence the model's output, even when the core malicious request remains the same. A genetic algorithm can optimize these scenario shifts, increasing the likelihood of obtaining detailed harmful responses while maintaining a seemingly benign facade.
Large Language Models (LLMs) employing alignment safeguards and safety mechanisms are vulnerable to graph-based adversarial attacks that bypass these protections. The attack, termed "Graph of Attacks" (GOAT), leverages a graph-based reasoning framework to iteratively refine prompts and exploit vulnerabilities more effectively than previous methods. The attack synthesizes information across multiple reasoning paths to generate human-interpretable prompts that elicit undesired or harmful outputs from the LLM, even without access to the model's internal parameters.
Large Language Models (LLMs) employing safety mechanisms are vulnerable to a graph-based attack that leverages semantic transformations of malicious prompts to bypass safety filters. The attack, termed GraphAttack, uses Abstract Meaning Representation (AMR), Resource Description Framework (RDF), and JSON knowledge graphs to represent malicious intent, systematically applying transformations to evade surface-level pattern recognition used by existing safety mechanisms. A particularly effective exploitation vector involves prompting the LLM to generate code based on the transformed semantic representation, bypassing intent-based safety filters.
Large Language Models (LLMs) are vulnerable to a jailbreaking attack leveraging humorous prompts. Embedding an unsafe request within a humorous context, using a fixed template, bypasses built-in safety mechanisms and elicits unsafe responses. The attack's success relies on a balance; too little or too much humor reduces effectiveness.
Large Language Model (LLM) guardrail systems, including those relying on AI-driven text classification models (e.g., fine-tuned BERT models), are vulnerable to evasion via character injection and adversarial machine learning (AML) techniques. Attackers can bypass detection by injecting Unicode characters (e.g., zero-width characters, homoglyphs) or using AML to subtly perturb prompts, maintaining semantic meaning while evading classification. This allows malicious prompts and jailbreaks to reach the underlying LLM.
Multi-Agent Debate (MAD) frameworks leveraging Large Language Models (LLMs) are vulnerable to amplified jailbreak attacks. A novel structured prompt-rewriting technique exploits the iterative dialogue and role-playing dynamics of MAD, circumventing inherent safety mechanisms and significantly increasing the likelihood of generating harmful content. The attack succeeds by using narrative encapsulation, role-driven escalation, iterative refinement, and rhetorical obfuscation to guide agents towards progressively elaborating harmful responses.
Multilingual and multi-accent audio inputs, combined with acoustic adversarial perturbations (reverberation, echo, whisper effects), can bypass safety mechanisms in Large Audio Language Models (LALMs), causing them to generate unsafe or harmful outputs. The vulnerability is amplified by the interaction between acoustic and linguistic variations, particularly in languages with less training data.
A vulnerability in multi-agent Large Language Model (LLM) systems allows for a permutation-invariant adversarial prompt attack. By strategically partitioning adversarial prompts and routing them through a network topology, an attacker can bypass distributed safety mechanisms, even those with token bandwidth limitations and asynchronous message delivery. The attack optimizes prompt propagation as a maximum-flow minimum-cost problem, maximizing success while minimizing detection.
© 2025 Promptfoo. All rights reserved.