Attacks leveraging model implementation details
Large Language Models (LLMs) employing gradient-based optimization for jailbreaking defense are vulnerable to enhanced transferability attacks due to superfluous constraints in their objective functions. Specifically, the "response pattern constraint" (forcing a specific initial response phrase) and the "token tail constraint" (penalizing variations in the response beyond a fixed prefix) limit the search space and reduce the effectiveness of attacks across different models. Removing these constraints significantly increases the success rate of attacks transferred to target models.
A vulnerability exists in large language models (LLMs) where the model's internal representations (activations) in specific latent subspaces can be manipulated to trigger jailbreak responses. By calculating a perturbation vector based on the difference between the mean activations of "safe" and "jailbroken" states, an attacker can introduce a targeted perturbation to the model's activations, causing it to generate unsafe outputs even when presented with a safe prompt. This manipulates the model's state, causing it to transition from a safe to a jailbroken state. The success rate is context-dependent.
Multimodal Large Language Models (MLLMs) are vulnerable to Jailbreak-Probability-based Attacks (JPA). JPA leverages a Jailbreak Probability Prediction Network (JPPN) to identify and optimize adversarial perturbations in input images, maximizing the probability of eliciting harmful responses from the MLLM, even with small perturbation bounds and few iterations. The attack operates by modifying the input image's hidden states within the MLLM to increase the predicted jailbreak probability.
Description: A vulnerability in large language models (LLMs) allows attackers to bypass safety-alignment mechanisms by manipulating the model's internal attention weights. The attack, termed "Attention Eclipse," modifies the attention scores between specific tokens within a prompt, either amplifying or suppressing attention to selectively strengthen or weaken the influence of certain parts of the prompt on the model's output. This allows injection of malicious content while appearing benign to the model's safety filters.
CRI (Compliance Refusal Initialization) initializes jailbreak attacks by leveraging pre-trained jailbreak prompts, effectively guiding the optimization process towards the compliance subspace of harmful prompts. This significantly enhances the success rate and reduces the computational overhead of attacks, often requiring only a single optimization step to bypass safety mechanisms. Attacks utilizing CRI demonstrate significantly improved ASR (Adversarial Success Rate) and reduced median steps to success.
Large Language Models (LLMs) employing safety alignment strategies are vulnerable to jailbreak attacks. These attacks manipulate the LLM's internal representation by activating "jailbreak concepts" in addition to "toxic concepts," causing the model to bypass safety guardrails and generate unsafe outputs despite recognizing the harmful nature of the input. The vulnerability stems from the insufficient mitigation of the influence of the activated jailbreak concepts on model output.
A context-coherent jailbreak attack (CCJA) allows bypassing safety mechanisms in aligned large language models (LLMs) by optimizing perturbations in the continuous word embedding space of a masked language model (MLM). The attack leverages the MLM's ability to reconstruct text from hidden states to generate semantically coherent yet malicious prompts that induce the target LLM to produce unsafe outputs, even with strong safety alignment. The attack's effectiveness is enhanced by using a seed prompt to generate an instruction-following prefix, which guides the LLM towards affirmative responses to harmful queries.
A vulnerability exists in Large Language Models (LLMs) that allows for efficient jailbreaking by selectively fine-tuning only the lower layers of the model with a toxic dataset. This "Freeze Training" method, as described in the research paper, concentrates the fine-tuning on layers identified as being highly sensitive to the generation of harmful content. This approach significantly reduces training duration and GPU memory consumption while maintaining a high jailbreak success rate.
Large Language Models (LLMs) trained with safety fine-tuning techniques are vulnerable to multi-dimensional evasion attacks. Safety-aligned behavior, such as refusing harmful queries, is controlled not by a single direction in activation space, but by a subspace of interacting directions. Manipulating non-dominant directions, which represent distinct jailbreak patterns or indirect features, can suppress the dominant direction responsible for refusal, thereby bypassing learned safety capabilities. This vulnerability is demonstrated on Llama 3 8B through removal of trigger tokens and suppression of non-dominant components in the safety residual space.
Large Language Models (LLMs) are vulnerable to one-shot steering vector optimization attacks. By applying gradient descent to a single training example, an attacker can generate steering vectors that induce or suppress specific behaviors across multiple inputs, even those unseen during the optimization process. This allows malicious actors to manipulate the model's output in a generalized way, bypassing safety mechanisms designed to prevent harmful responses.
© 2025 Promptfoo. All rights reserved.