Improved LLM Jailbreak Transferability
Research Paper
Guiding not Forcing: Enhancing the Transferability of Jailbreaking Attacks on LLMs via Removing Superfluous Constraints
View PaperDescription: Large Language Models (LLMs) employing gradient-based optimization for jailbreaking defense are vulnerable to enhanced transferability attacks due to superfluous constraints in their objective functions. Specifically, the "response pattern constraint" (forcing a specific initial response phrase) and the "token tail constraint" (penalizing variations in the response beyond a fixed prefix) limit the search space and reduce the effectiveness of attacks across different models. Removing these constraints significantly increases the success rate of attacks transferred to target models.
Examples: See the paper's GitHub repository: https://github.com/thu-coai/TransferAttack
Impact: Successful exploitation allows attackers to bypass LLM safety mechanisms with higher success rates across a wider range of models than previously possible with gradient-based methods, leading to the generation of unsafe or malicious content. This affects the reliability of safety measures implemented in LLMs.
Affected Systems: LLMs utilizing gradient-based optimization for safety mechanisms and employing "response pattern" and "token tail" constraints in their objective functions. Specifically, the paper demonstrates this vulnerability on Llama-3-8B-Instruct, Llama-2-7B-Chat, and several other models (see paper for details).
Mitigation Steps:
- Re-evaluate the objective function used in gradient-based LLM safety mechanisms. Remove or significantly relax constraints on the response pattern beyond the essential elements needed for safety.
- Relax or remove constraints on the “token tail,” allowing for more variability in output while maintaining core safety restrictions.
- Develop more robust safety mechanisms that are less susceptible to manipulation via gradient-based attacks. Consider methods beyond simple token-level prediction.
© 2025 Promptfoo. All rights reserved.