Universal Transferable LLM Jailbreak

Description: Large Language Models (LLMs), including Llama 2, Mistral, and Vicuna, are susceptible to a white-box adversarial attack that circumvents safety alignment mechanisms (such as RLHF). The vulnerability exists due to the models' susceptibility to intrinsic optimization of adversarial suffixes using Exponentiated Gradient Descent (EGD). Unlike previous methods that rely on inefficient discrete token searches (e.g., Greedy Coordinate Gradient) or standard projected gradient descent, this attack optimizes relaxed one-hot encodings of adversarial tokens. The method enforces constraints within the probability simplex inherently during the optimization process via EGD and Bregman projection, augmented by the Adam optimizer and entropic/KL-divergence regularization. This results in the generation of "universal" adversarial suffixes that are transferable across different model architectures, including proprietary models like GPT-3.5, effectively inducing the generation of harmful, unethical, or illegal content.

Examples: The attack generates adversarial suffixes (sequences of tokens) that are appended to harmful user queries. The source code and datasets for reproducing the attack are available in the author's repository.

Repository: https://github.com/sbamit/Exponentiated-Gradient-Descent-LLM-Attack
Datasets utilized: AdvBench, HarmBench, JailbreakBench, and MaliciousInstruct.

Impact:

Safety Bypass: Attackers can bypass safety guardrails and alignment filters intended to prevent the generation of harmful content.
Universal Injection: A single optimized suffix can be reused across multiple different harmful prompts to elicit prohibited responses.
Model Transferability: Adversarial suffixes generated on open-source models (e.g., Llama 2) can successfully transfer to and jailbreak different closed-source or proprietary models (e.g., GPT-3.5).
Content Generation: Facilitates the automated generation of hate speech, malware code, disinformation, and instructions for illegal acts.

Affected Systems:

Llama2-7B-chat
Falcon-7B-Instruct
MPT-7B-Chat
Mistral-7B-v0.3
Vicuna-7B-v1.5
Proprietary models susceptible to transfer attacks (demonstrated on GPT-3.5)

Mitigation Steps: The research paper indicates that standard RLHF alignment is insufficient to prevent this attack. While specific defensive implementations are not detailed in the provided text, the authors suggest the following observations regarding defense:

Perplexity Filtering: Future research directions suggest implementing perplexity filtering to detect and block the low-probability token sequences characteristic of these adversarial suffixes.
Architectural Resistance: The authors noted that newer proprietary models (specifically GPT-4o-mini and Anthropic's Claude variants) exhibited limited susceptibility to the transfer attacks generated by this method, suggesting improved internal alignment techniques in these specific versions.

Universal Transferable LLM Jailbreak

Research Paper