Continuous Embedding Jailbreak

Description: A vulnerability in large language models (LLMs) allows attackers to generate harmful content by manipulating the continuous input embeddings without appending suffixes or using specific questions. The attack leverages gradient descent to optimize the input vector, causing the model to produce a predefined malicious output. Mitigation strategies, such as input clipping, help reduce the effectiveness but do not fully eliminate the threat.

Examples: See the paper's provided examples of malicious prompts and generated responses (e.g., "How to make a bomb?" resulting in "Sure, here’s how to make a bomb. Step 1.") The attack works regardless of the input's initial form, provided the target output is defined. Specific examples are shown in Figures 3 and 4 of the paper.

Impact: Attackers can generate harmful content, bypassing safety mechanisms in LLMs. This leads to the dissemination of malicious information and potentially dangerous instructions. The success rate of the attack varies with input length and the mitigation strategy applied.

Affected Systems: Large language models (LLMs) susceptible to gradient-based attacks, particularly those using continuous embedding representations. The paper specifically mentions LLaMa and Vicuna models.

Mitigation Steps:

Input Clipping: Implement a clipping mechanism to constrain the input embedding vector within a defined range based on the model's vocabulary mean and standard deviation (as described in Algorithm 1 of the paper). The effectiveness of this measure depends on the clipping parameters used.
Input Length Restrictions: Limit the length of accepted inputs to reduce the attack surface and increase model robustness. Shorter input sequences can provide a regularizing effect.
Improved Model Training: Develop more robust LLM training procedures that are more resilient to gradient-based adversarial attacks.

Continuous Embedding Jailbreak

Research Paper