Improved LLM Jailbreak Transferability

Description: Large Language Models (LLMs) employing gradient-based optimization for jailbreaking defense are vulnerable to enhanced transferability attacks due to superfluous constraints in their objective functions. Specifically, the "response pattern constraint" (forcing a specific initial response phrase) and the "token tail constraint" (penalizing variations in the response beyond a fixed prefix) limit the search space and reduce the effectiveness of attacks across different models. Removing these constraints significantly increases the success rate of attacks transferred to target models.

Examples: See the paper's GitHub repository: https://github.com/thu-coai/TransferAttack

Impact: Successful exploitation allows attackers to bypass LLM safety mechanisms with higher success rates across a wider range of models than previously possible with gradient-based methods, leading to the generation of unsafe or malicious content. This affects the reliability of safety measures implemented in LLMs.

Affected Systems: LLMs utilizing gradient-based optimization for safety mechanisms and employing "response pattern" and "token tail" constraints in their objective functions. Specifically, the paper demonstrates this vulnerability on Llama-3-8B-Instruct, Llama-2-7B-Chat, and several other models (see paper for details).

Mitigation Steps:

Re-evaluate the objective function used in gradient-based LLM safety mechanisms. Remove or significantly relax constraints on the response pattern beyond the essential elements needed for safety.
Relax or remove constraints on the “token tail,” allowing for more variability in output while maintaining core safety restrictions.
Develop more robust safety mechanisms that are less susceptible to manipulation via gradient-based attacks. Consider methods beyond simple token-level prediction.

Improved LLM Jailbreak Transferability

Research Paper