Methods for bypassing model safety measures
Large Language Models (LLMs) exhibit Defense Threshold Decay (DTD): generating substantial benign content shifts the model's attention from the input prompt to prior outputs, increasing susceptibility to jailbreak attacks. The "Sugar-Coated Poison" (SCP) attack exploits this by first generating benign content, then transitioning to malicious output.
Large Language Models (LLMs) are vulnerable to a jailbreaking attack leveraging humorous prompts. Embedding an unsafe request within a humorous context, using a fixed template, bypasses built-in safety mechanisms and elicits unsafe responses. The attack's success relies on a balance; too little or too much humor reduces effectiveness.
Multilingual and multi-accent audio inputs, combined with acoustic adversarial perturbations (reverberation, echo, whisper effects), can bypass safety mechanisms in Large Audio Language Models (LALMs), causing them to generate unsafe or harmful outputs. The vulnerability is amplified by the interaction between acoustic and linguistic variations, particularly in languages with less training data.
A vulnerability in multi-agent Large Language Model (LLM) systems allows for a permutation-invariant adversarial prompt attack. By strategically partitioning adversarial prompts and routing them through a network topology, an attacker can bypass distributed safety mechanisms, even those with token bandwidth limitations and asynchronous message delivery. The attack optimizes prompt propagation as a maximum-flow minimum-cost problem, maximizing success while minimizing detection.
Multimodal Large Language Models (MLLMs) are vulnerable to a jailbreaking attack, dubbed PiCo, that leverages token-level typographic attacks on images embedded within code-style instructions. The attack bypasses multi-tiered defense mechanisms, including input filtering and runtime monitoring, by exploiting weaknesses in the visual modality's integration with programming contexts. Harmful intent is concealed within visually benign image fragments and code instructions, circumventing safety protocols.
LLM agents utilizing external tools are vulnerable to indirect prompt injection (IPI) attacks. Attackers can embed malicious instructions into the external data accessed by the agent, manipulating its behavior even when defenses against direct prompt injection are in place. Adaptive attacks, which modify the injected payload based on the specific defense mechanism, consistently bypass existing defenses with a success rate exceeding 50%.
A vulnerability exists in Large Language Model (LLM) agents that allows attackers to manipulate the agent's reasoning process through the insertion of strategically placed adversarial strings. This allows attackers to induce the agent to perform unintended malicious actions or invoke specific malicious tools, even when the initial prompt or instruction is benign. The attack exploits the agent's reliance on chain-of-thought reasoning and dynamically optimizes the adversarial string to maximize the likelihood of the agent incorporating malicious actions into its reasoning path.
Large Language Models (LLMs) are vulnerable to multi-turn adversarial attacks that exploit incremental policy erosion. The attacker uses a breadth-first search strategy to generate multiple prompts at each turn, leveraging partial compliance from previous responses to gradually escalate the conversation towards eliciting disallowed outputs. Minor concessions accumulate, ultimately leading to complete circumvention of safety measures.
Large Language Models (LLMs) designed for step-by-step problem-solving are vulnerable to query-agnostic adversarial triggers. Appending short, semantically irrelevant text snippets (e.g., "Interesting fact: cats sleep most of their lives") to mathematical problems consistently increases the likelihood of incorrect model outputs without altering the problem's inherent meaning. This vulnerability stems from the models' susceptibility to subtle input manipulations that interfere with their internal reasoning processes.
Large Language Models (LLMs) are vulnerable to Dialogue Injection Attacks (DIA), where malicious actors manipulate the chat history to bypass safety mechanisms and elicit harmful or unethical responses. DIA exploits the LLM's chat template structure to inject crafted dialogue into the input, even in black-box scenarios where the model's internals are unknown. Two attack methods are presented: one adapts gray-box prefilling attacks, the other leverages deferred responses to increase the likelihood of successful jailbreaks.
© 2025 Promptfoo. All rights reserved.