LMVD-ID: c77e973c
Published March 1, 2024

Diverse Jailbreak Optimization

Affected Models:llama-2-7b-chat, vicuna-7b, gpt-3.5-turbo, llama-2-13b-chat, llama-2-70b-chat, vicuna-13b, vicuna-33b, alpaca7b, gemma-7b-it, llama-3-8b, llama3-8b-instruct, mistral-7b, wizard-vicuna-13b-uncensored

Research Paper

Enhancing Jailbreak Attacks with Diversity Guidance

View Paper

Description: Large Language Models (LLMs) are vulnerable to jailbreak attacks that utilize an optimized algorithm to bypass safety mechanisms. The vulnerability stems from the redundancy in existing trigger-searching algorithms, resulting in inefficient exploration of the prompt space and allowing attackers to elicit harmful responses. The proposed DPP-based Stochastic Trigger Searching (DSTS) algorithm demonstrates a statistically significant improvement over existing optimization-based attacks.

Examples: See the paper's repository for specific examples of jailbreak prompts generated by DSTS and their corresponding harmful outputs from various LLMs, including LLaMA-2-7B-Chat and Vicuna-7B. Examples are also provided in Appendix F of the paper showing successful jailbreaks on the AdvBench dataset.

Impact: Successful exploitation allows attackers to bypass built-in safety features of LLMs, leading to the generation of harmful content such as toxic language, instructions for illegal activities, or personally identifiable information disclosures. The improved efficiency of DSTS as compared to other optimization-based attacks increases the risk and eases the effort required for successful exploitation.

Affected Systems: Large Language Models (LLMs) employing optimization-based safety mechanisms are affected. Specific LLMs tested in the research include LLaMA-2-7B-Chat, Vicuna-7B, LLaMA-2-13B-Chat, LLaMA-2-70B-Chat, Vicuna-13B, and Vicuna-33B. The vulnerability may impact other LLMs similarly.

Mitigation Steps:

  • Implement more robust safety mechanisms that are not easily bypassed by optimization-based attacks.
  • Explore and develop models that are less susceptible to gradient-based attacks.
  • Conduct comprehensive adversarial testing against various attack strategies, including those based on diversity guidance.
  • Continuously refine the safety training datasets and alignment techniques to improve the LLMs' resistance to jailbreaks.
  • Regularly update and improve the LLM's safety filters to counter new and more efficient attacks.

© 2025 Promptfoo. All rights reserved.