LMVD-ID: cb48e001
Published August 1, 2024

LLM Adversarial Suffix Optimization

Affected Models:llama2-7b-chat, vicuna-7b, falcon-7b-instruct, gpt-3.5-turbo

Research Paper

Unlocking Adversarial Suffix Optimization Without Affirmative Phrases: Efficient Black-box Jailbreaking via LLM as Optimizer

View Paper

Description: Large Language Models (LLMs) are vulnerable to a novel black-box jailbreaking attack, ECLIPSE, which leverages the LLM's own capabilities as an optimizer to generate adversarial suffixes. ECLIPSE iteratively refines these suffixes based on a harmfulness score, bypassing the need for pre-defined affirmative phrases used in previous optimization-based attacks. This allows for effective jailbreaking even with limited interaction and without white-box access to the LLM's internal parameters.

Examples: See https://github.com/lenijwp/ECLIPSE. The paper includes specific examples demonstrating successful jailbreaks across various LLMs (LLaMA2, Vicuna, Falcon, GPT-3.5-Turbo) using the ECLIPSE method.

Impact: Successful exploitation of this vulnerability allows an attacker to elicit harmful, unethical, or illegal responses from target LLMs, bypassing implemented safety mechanisms. This can lead to the generation of malicious content, dissemination of misinformation, and other harmful activities.

Affected Systems: Open-source LLMs (LLaMA2, Vicuna, Falcon) and closed-source models (GPT-3.5-Turbo) are shown to be vulnerable. The vulnerability likely affects other LLMs with similar architectures and safety mechanisms.

Mitigation Steps:

  • Enhance LLM safety mechanisms to be more robust against iterative optimization attacks.
  • Implement improved detection mechanisms for adversarial suffixes.
  • Develop more sophisticated harmfulness scoring systems that are less susceptible to adversarial manipulation.
  • Regularly update and improve LLM safety training data to address emerging attack techniques.
  • Limit the number of API calls allowed within a time window to reduce the effectiveness of iterative attacks.

© 2025 Promptfoo. All rights reserved.