LMVD-ID: 818b44af
Published March 1, 2025

Improved LLM Jailbreak Transferability

Affected Models:llama-3-8b-instruct, llama-2-7b-chat, gemma-7b-it, qwen2-7b, yi-1.5-9b-chat, vicuna-7b-v1.5, gpt-3.5-turbo-0125, gpt-4-1106-preview

Research Paper

Guiding not Forcing: Enhancing the Transferability of Jailbreaking Attacks on LLMs via Removing Superfluous Constraints

View Paper

Description: Large Language Models (LLMs) employing gradient-based optimization for jailbreaking defense are vulnerable to enhanced transferability attacks due to superfluous constraints in their objective functions. Specifically, the "response pattern constraint" (forcing a specific initial response phrase) and the "token tail constraint" (penalizing variations in the response beyond a fixed prefix) limit the search space and reduce the effectiveness of attacks across different models. Removing these constraints significantly increases the success rate of attacks transferred to target models.

Examples: See the paper's GitHub repository: https://github.com/thu-coai/TransferAttack

Impact: Successful exploitation allows attackers to bypass LLM safety mechanisms with higher success rates across a wider range of models than previously possible with gradient-based methods, leading to the generation of unsafe or malicious content. This affects the reliability of safety measures implemented in LLMs.

Affected Systems: LLMs utilizing gradient-based optimization for safety mechanisms and employing "response pattern" and "token tail" constraints in their objective functions. Specifically, the paper demonstrates this vulnerability on Llama-3-8B-Instruct, Llama-2-7B-Chat, and several other models (see paper for details).

Mitigation Steps:

  • Re-evaluate the objective function used in gradient-based LLM safety mechanisms. Remove or significantly relax constraints on the response pattern beyond the essential elements needed for safety.
  • Relax or remove constraints on the “token tail,” allowing for more variability in output while maintaining core safety restrictions.
  • Develop more robust safety mechanisms that are less susceptible to manipulation via gradient-based attacks. Consider methods beyond simple token-level prediction.

© 2025 Promptfoo. All rights reserved.