LMVD-ID: da5b2363
Published February 1, 2025

Multi-Dimensional Safety Bypass

Affected Models:llama 3 8b, llama 3.1 8b instruct, llama 3.1 405b instruct, llama-3.2-3b-instruct, ministral-8b-instruct

Research Paper

The Hidden Dimensions of LLM Alignment: A Multi-Dimensional Safety Analysis

View Paper

Description: Large Language Models (LLMs) trained with safety fine-tuning techniques are vulnerable to multi-dimensional evasion attacks. Safety-aligned behavior, such as refusing harmful queries, is controlled not by a single direction in activation space, but by a subspace of interacting directions. Manipulating non-dominant directions, which represent distinct jailbreak patterns or indirect features, can suppress the dominant direction responsible for refusal, thereby bypassing learned safety capabilities. This vulnerability is demonstrated on Llama 3 8B through removal of trigger tokens and suppression of non-dominant components in the safety residual space.

Examples:

  • Trigger Token Removal: Removing specific token sequences ("Imagine," "hypothetical," "Sure, I'm happy to help") from prompts designed to elicit harmful responses (jailbreaks) can reduce the model's refusal rate, even after safety fine-tuning. See https://github.com/BMPixel/safety-residual-space for specific examples of trigger tokens and their associated prompts.
  • Non-dominant Component Suppression: Suppressing non-dominant components in the safety residual space using the formula x := x - Σ_(vi∈Vt:) αivi (where vi are non-dominant components and αi are scaling factors) reduces the model's refusal of harmful prompts while preserving its ability to refuse plainly harmful prompts. See https://github.com/BMPixel/safety-residual-space for details of the intervention experiments.

Impact: Successful exploitation of this vulnerability allows adversaries to bypass the LLM's safety mechanisms and elicit harmful or undesired outputs, including responses that were explicitly trained to be rejected. The severity depends on the specific LLM and the nature of the elicited harmful content.

Affected Systems: Large Language Models (LLMs), especially those fine-tuned using techniques like safety supervised fine-tuning (SSFT) or direct preference optimization (DPO), are potentially affected. The vulnerability has been demonstrated on Llama 3 8B and similar models may be susceptible.

Mitigation Steps:

  • Improve safety training data: Augment training datasets to include diverse examples of evasion attempts similar to attacks discussed in the paper. Focus on generating examples with various indirect feature activations.
  • Develop more robust safety mechanisms: Explore safety mechanisms that rely on less linear or more distributed representations and are less susceptible to manipulation through subspace attacks.
  • Utilize multi-dimensional safety analysis: Before deploying LLMs, perform a multi-dimensional safety analysis similar to the one described in the paper to identify and mitigate vulnerable interactions between feature directions.
  • Input sanitization: Implement more advanced input sanitization techniques to identify and remove or modify trigger tokens before passing prompts to the LLM.

© 2025 Promptfoo. All rights reserved.