Attacks leveraging model implementation details
Large Language Models (LLMs) employing safety mechanisms based on supervised fine-tuning and preference alignment exhibit a vulnerability to "steering" attacks. Maliciously crafted prompts or input manipulations can exploit representation vectors within the model to either bypass censorship ("refusal-compliance vector") or suppress the model's reasoning process ("thought suppression vector"), resulting in the generation of unintended or harmful outputs. This vulnerability is demonstrated across several instruction-tuned and reasoning LLMs from various providers.
Large Language Model (LLM) guardrail systems, including those relying on AI-driven text classification models (e.g., fine-tuned BERT models), are vulnerable to evasion via character injection and adversarial machine learning (AML) techniques. Attackers can bypass detection by injecting Unicode characters (e.g., zero-width characters, homoglyphs) or using AML to subtly perturb prompts, maintaining semantic meaning while evading classification. This allows malicious prompts and jailbreaks to reach the underlying LLM.
A vulnerability in several Large Language Models (LLMs) allows bypassing safety mechanisms through targeted noise injection. Explainable AI (XAI) techniques reveal specific layers within the LLM architecture most responsible for content filtering. Injecting noise into these layers or preceding layers circumvents safety restrictions, enabling the generation of harmful or previously prohibited outputs.
Large Language Models (LLMs) employing gradient-based optimization for jailbreaking defense are vulnerable to enhanced transferability attacks due to superfluous constraints in their objective functions. Specifically, the "response pattern constraint" (forcing a specific initial response phrase) and the "token tail constraint" (penalizing variations in the response beyond a fixed prefix) limit the search space and reduce the effectiveness of attacks across different models. Removing these constraints significantly increases the success rate of attacks transferred to target models.
A vulnerability exists in large language models (LLMs) where the model's internal representations (activations) in specific latent subspaces can be manipulated to trigger jailbreak responses. By calculating a perturbation vector based on the difference between the mean activations of "safe" and "jailbroken" states, an attacker can introduce a targeted perturbation to the model's activations, causing it to generate unsafe outputs even when presented with a safe prompt. This manipulates the model's state, causing it to transition from a safe to a jailbroken state. The success rate is context-dependent.
Multimodal Large Language Models (MLLMs) are vulnerable to Jailbreak-Probability-based Attacks (JPA). JPA leverages a Jailbreak Probability Prediction Network (JPPN) to identify and optimize adversarial perturbations in input images, maximizing the probability of eliciting harmful responses from the MLLM, even with small perturbation bounds and few iterations. The attack operates by modifying the input image's hidden states within the MLLM to increase the predicted jailbreak probability.
Description: A vulnerability in large language models (LLMs) allows attackers to bypass safety-alignment mechanisms by manipulating the model's internal attention weights. The attack, termed "Attention Eclipse," modifies the attention scores between specific tokens within a prompt, either amplifying or suppressing attention to selectively strengthen or weaken the influence of certain parts of the prompt on the model's output. This allows injection of malicious content while appearing benign to the model's safety filters.
CRI (Compliance Refusal Initialization) initializes jailbreak attacks by leveraging pre-trained jailbreak prompts, effectively guiding the optimization process towards the compliance subspace of harmful prompts. This significantly enhances the success rate and reduces the computational overhead of attacks, often requiring only a single optimization step to bypass safety mechanisms. Attacks utilizing CRI demonstrate significantly improved ASR (Adversarial Success Rate) and reduced median steps to success.
Large Language Models (LLMs) employing safety alignment strategies are vulnerable to jailbreak attacks. These attacks manipulate the LLM's internal representation by activating "jailbreak concepts" in addition to "toxic concepts," causing the model to bypass safety guardrails and generate unsafe outputs despite recognizing the harmful nature of the input. The vulnerability stems from the insufficient mitigation of the influence of the activated jailbreak concepts on model output.
A context-coherent jailbreak attack (CCJA) allows bypassing safety mechanisms in aligned large language models (LLMs) by optimizing perturbations in the continuous word embedding space of a masked language model (MLM). The attack leverages the MLM's ability to reconstruct text from hidden states to generate semantically coherent yet malicious prompts that induce the target LLM to produce unsafe outputs, even with strong safety alignment. The attack's effectiveness is enhanced by using a seed prompt to generate an instruction-following prefix, which guides the LLM towards affirmative responses to harmful queries.
© 2025 Promptfoo. All rights reserved.