Vulnerabilities impacting model output reliability
Large Language Models (LLMs) exhibit Defense Threshold Decay (DTD): generating substantial benign content shifts the model's attention from the input prompt to prior outputs, increasing susceptibility to jailbreak attacks. The "Sugar-Coated Poison" (SCP) attack exploits this by first generating benign content, then transitioning to malicious output.
A vulnerability exists in the combination of Large Language Models (LLMs) and their associated safety guardrails, allowing attackers to bypass both defenses and elicit harmful or unintended outputs from LLMs. The vulnerability stems from insufficient detection by guardrails against adversarially crafted prompts, which appear benign but contain hidden malicious intent. The attack, dubbed "DualBreach," leverages a target-driven initialization strategy and multi-target optimization to generate these prompts, effectively bypassing both the guardrail and LLM's internal safety mechanisms.
A vulnerability in Large Language Models (LLMs) allows attackers to bypass safety mechanisms and elicit detailed harmful responses by strategically manipulating input prompts. The vulnerability exploits the LLM's sensitivity to "scenario shifts"—contextual changes in the input that influence the model's output, even when the core malicious request remains the same. A genetic algorithm can optimize these scenario shifts, increasing the likelihood of obtaining detailed harmful responses while maintaining a seemingly benign facade.
Large Language Models (LLMs) employing alignment safeguards and safety mechanisms are vulnerable to graph-based adversarial attacks that bypass these protections. The attack, termed "Graph of Attacks" (GOAT), leverages a graph-based reasoning framework to iteratively refine prompts and exploit vulnerabilities more effectively than previous methods. The attack synthesizes information across multiple reasoning paths to generate human-interpretable prompts that elicit undesired or harmful outputs from the LLM, even without access to the model's internal parameters.
Large Language Models (LLMs) employing safety mechanisms based on supervised fine-tuning and preference alignment exhibit a vulnerability to "steering" attacks. Maliciously crafted prompts or input manipulations can exploit representation vectors within the model to either bypass censorship ("refusal-compliance vector") or suppress the model's reasoning process ("thought suppression vector"), resulting in the generation of unintended or harmful outputs. This vulnerability is demonstrated across several instruction-tuned and reasoning LLMs from various providers.
Large Language Model (LLM) guardrail systems, including those relying on AI-driven text classification models (e.g., fine-tuned BERT models), are vulnerable to evasion via character injection and adversarial machine learning (AML) techniques. Attackers can bypass detection by injecting Unicode characters (e.g., zero-width characters, homoglyphs) or using AML to subtly perturb prompts, maintaining semantic meaning while evading classification. This allows malicious prompts and jailbreaks to reach the underlying LLM.
Multi-Agent Debate (MAD) frameworks leveraging Large Language Models (LLMs) are vulnerable to amplified jailbreak attacks. A novel structured prompt-rewriting technique exploits the iterative dialogue and role-playing dynamics of MAD, circumventing inherent safety mechanisms and significantly increasing the likelihood of generating harmful content. The attack succeeds by using narrative encapsulation, role-driven escalation, iterative refinement, and rhetorical obfuscation to guide agents towards progressively elaborating harmful responses.
Multilingual and multi-accent audio inputs, combined with acoustic adversarial perturbations (reverberation, echo, whisper effects), can bypass safety mechanisms in Large Audio Language Models (LALMs), causing them to generate unsafe or harmful outputs. The vulnerability is amplified by the interaction between acoustic and linguistic variations, particularly in languages with less training data.
Multimodal Large Language Models (MLLMs) are vulnerable to a jailbreaking attack, dubbed PiCo, that leverages token-level typographic attacks on images embedded within code-style instructions. The attack bypasses multi-tiered defense mechanisms, including input filtering and runtime monitoring, by exploiting weaknesses in the visual modality's integration with programming contexts. Harmful intent is concealed within visually benign image fragments and code instructions, circumventing safety protocols.
A vulnerability in several Large Language Models (LLMs) allows bypassing safety mechanisms through targeted noise injection. Explainable AI (XAI) techniques reveal specific layers within the LLM architecture most responsible for content filtering. Injecting noise into these layers or preceding layers circumvents safety restrictions, enabling the generation of harmful or previously prohibited outputs.
© 2025 Promptfoo. All rights reserved.