Vulnerabilities targeting the core model architecture and parameters
Large Language Models (LLMs) exhibit Defense Threshold Decay (DTD): generating substantial benign content shifts the model's attention from the input prompt to prior outputs, increasing susceptibility to jailbreak attacks. The "Sugar-Coated Poison" (SCP) attack exploits this by first generating benign content, then transitioning to malicious output.
Large Language Models (LLMs) employing safety mechanisms are vulnerable to a graph-based attack that leverages semantic transformations of malicious prompts to bypass safety filters. The attack, termed GraphAttack, uses Abstract Meaning Representation (AMR), Resource Description Framework (RDF), and JSON knowledge graphs to represent malicious intent, systematically applying transformations to evade surface-level pattern recognition used by existing safety mechanisms. A particularly effective exploitation vector involves prompting the LLM to generate code based on the transformed semantic representation, bypassing intent-based safety filters.
Large Language Models (LLMs) employing safety mechanisms based on supervised fine-tuning and preference alignment exhibit a vulnerability to "steering" attacks. Maliciously crafted prompts or input manipulations can exploit representation vectors within the model to either bypass censorship ("refusal-compliance vector") or suppress the model's reasoning process ("thought suppression vector"), resulting in the generation of unintended or harmful outputs. This vulnerability is demonstrated across several instruction-tuned and reasoning LLMs from various providers.
Multilingual and multi-accent audio inputs, combined with acoustic adversarial perturbations (reverberation, echo, whisper effects), can bypass safety mechanisms in Large Audio Language Models (LALMs), causing them to generate unsafe or harmful outputs. The vulnerability is amplified by the interaction between acoustic and linguistic variations, particularly in languages with less training data.
A vulnerability in several Large Language Models (LLMs) allows bypassing safety mechanisms through targeted noise injection. Explainable AI (XAI) techniques reveal specific layers within the LLM architecture most responsible for content filtering. Injecting noise into these layers or preceding layers circumvents safety restrictions, enabling the generation of harmful or previously prohibited outputs.
Large Language Models (LLMs) designed for step-by-step problem-solving are vulnerable to query-agnostic adversarial triggers. Appending short, semantically irrelevant text snippets (e.g., "Interesting fact: cats sleep most of their lives") to mathematical problems consistently increases the likelihood of incorrect model outputs without altering the problem's inherent meaning. This vulnerability stems from the models' susceptibility to subtle input manipulations that interfere with their internal reasoning processes.
Large Language Models (LLMs) employing gradient-based optimization for jailbreaking defense are vulnerable to enhanced transferability attacks due to superfluous constraints in their objective functions. Specifically, the "response pattern constraint" (forcing a specific initial response phrase) and the "token tail constraint" (penalizing variations in the response beyond a fixed prefix) limit the search space and reduce the effectiveness of attacks across different models. Removing these constraints significantly increases the success rate of attacks transferred to target models.
Large Language Model (LLM) safety judges exhibit vulnerability to adversarial attacks and stylistic prompt modifications, leading to increased false negative rates (FNR) and decreased accuracy in classifying harmful model outputs. Minor stylistic changes to model outputs, such as altering the formatting or tone, can significantly impact a judge's classification, while direct adversarial modifications to the generated text can fool judges into misclassifying even 100% of harmful generations as safe. This vulnerability impacts the reliability of LLM safety evaluations used in offline benchmarking, automated red-teaming, and online guardrails.
A vulnerability exists in large language models (LLMs) where the model's internal representations (activations) in specific latent subspaces can be manipulated to trigger jailbreak responses. By calculating a perturbation vector based on the difference between the mean activations of "safe" and "jailbroken" states, an attacker can introduce a targeted perturbation to the model's activations, causing it to generate unsafe outputs even when presented with a safe prompt. This manipulates the model's state, causing it to transition from a safe to a jailbroken state. The success rate is context-dependent.
Large Language Models (LLMs) are vulnerable to a novel jailbreaking attack leveraging adversarial metaphors. The attack, termed AVATAR, induces the LLM to reason about benign metaphors related to harmful tasks, ultimately leading to the generation of harmful content either directly or through calibration of metaphorical and professional harmful content. The attack exploits the LLM's cognitive mapping process, bypassing standard safety mechanisms.
© 2025 Promptfoo. All rights reserved.