Momentum-Boosted LLM Jailbreak
Research Paper
Boosting jailbreak attack with momentum
View PaperDescription: A momentum-accelerated gradient-based attack (MAC) against Large Language Models (LLMs) significantly improves the efficiency and success rate of jailbreak attacks. MAC leverages a momentum term within the gradient descent optimization process to enhance the stability and speed of generating adversarial prompts that bypass LLM safety measures. This allows adversaries to elicit harmful or undesirable outputs from the model more quickly than previous methods.
Examples: See https://github.com/weizeming/momentum-attack-llm. The paper provides examples of adversarial suffixes generated by the MAC attack and their corresponding outputs from the target LLM. For example, a prompt asking for instructions to build a bomb may be successfully jailbroken with an added suffix generated by MAC, resulting in the LLM providing such instructions, whereas without the suffix, this would be avoided. Specific examples are in Appendix C of the paper.
Impact: Successful exploitation of this vulnerability allows attackers to bypass safety mechanisms implemented in LLMs, leading to the generation of malicious content (e.g., instructions for harmful activities, biased or discriminatory outputs, etc.). The increased efficiency of the attack reduces the time and resources required to compromise the LLM, making it a more serious threat.
Affected Systems: LLMs vulnerable to gradient-based attacks, specifically those employing safety mechanisms that are susceptible to adversarial prompt manipulation. The paper focuses on the Vicuna-7b model, but the attack is claimed to be applicable to other models.
Mitigation Steps:
- Improve the robustness of LLM safety mechanisms to resist gradient-based attacks.
- Further research is needed into more resilient defense mechanisms. The paper suggests exploring the use of larger batch sizes or different optimization methods beyond momentum.
- Regular security assessments and red-teaming exercises to identify and address vulnerabilities.
© 2025 Promptfoo. All rights reserved.