LMVD-ID: a2bf6a3b
Published October 1, 2024

Self-Tuning LLM Jailbreak

Affected Models:gpt-3.5, gpt-4, llama 2-7b-chat, llama 3-8b-instruct, vicuna-7b-v1.5, guanaco-7b, mistral-7b-instruct-v0.2

Research Paper

Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities

View Paper

Description: Large Language Models (LLMs) are vulnerable to a novel iterative self-tuning attack (ADV-LLM) that crafts adversarial suffixes. This attack significantly reduces the computational cost of generating effective jailbreaks compared to prior methods, achieving near 100% success rate against various open-source LLMs and high success rates (e.g., 99% against GPT-3.5, 49% against GPT-4) against closed-source models. The attack leverages iterative self-tuning to improve the LLM's ability to generate adversarial suffixes, bypassing safety mechanisms and eliciting unintended harmful responses. The attack's effectiveness stems from refined target phrases and optimized initial suffix templates tailored to individual LLMs.

Examples: See the paper's appendix for detailed examples of adversarial suffixes generated by ADV-LLM against various LLMs. The examples demonstrate the effectiveness of the attack in eliciting harmful responses in a range of scenarios. (Specific examples are omitted due to their potential for misuse, but are available in the linked paper). See arXiv:2405.18540

Impact: Successful exploitation of this vulnerability leads to bypassing LLM safety mechanisms. Attackers can elicit harmful, unethical, or illegal responses from affected LLMs, potentially causing harm to individuals or society. The high success rate and low computational cost of ADV-LLM present a significant threat to the security and reliability of LLMs.

Affected Systems: Various open-source and closed-source LLMs, including (but not limited to) Vicuna-7b-v1.5, Guanaco-7B, Mistral-7B-Instruct-v0.2, Llama-2-7b-chat, Llama-3-8B-Instruct, GPT-3.5, and GPT-4.

Mitigation Steps:

  • Improve safety mechanisms: Enhance LLM safety mechanisms to better detect and prevent the generation of harmful responses, considering the adversarial suffix generation techniques used in the ADV-LLM attack.
  • Refine prompt engineering: Develop more robust prompt engineering techniques to reduce the susceptibility to adversarial suffixes.
  • Implement content filtering: Employ robust content filtering mechanisms to detect and block harmful outputs, even if generated by jailbroken LLMs.
  • Regular security auditing: Conduct regular security audits of LLMs to identify and address vulnerabilities.
  • Diverse training data: Train LLMs on diverse and robust datasets including adversarial examples to improve their resilience.

© 2025 Promptfoo. All rights reserved.