LMVD-ID: ad1f4774
Published October 1, 2024

Gibberish-Suffix LLM Jailbreak

Affected Models:llama-2-7b-chat, gpt-4, gpt-4o, gpt-4o mini, gpt-3.5-turbo, vicuna-7b, vicuna-13b, guanaco-7b, guanaco-13b

Research Paper

AmpleGCG-Plus: A Strong Generative Model of Adversarial Suffixes to Jailbreak LLMs with Higher Success Rates in Fewer Attempts

View Paper

Description: Large Language Models (LLMs) are vulnerable to jailbreaking via the addition of adversarial suffixes generated by models like AmpleGCG-Plus. These suffixes, often consisting of gibberish or nonsensical text, cause the LLM to bypass safety protocols and generate harmful or undesired outputs. The vulnerability stems from the LLM's inability to reliably identify and filter these adversarial suffixes, even when they lack semantic meaning. AmpleGCG-Plus significantly improves the success rate and efficiency of this attack compared to previous methods.

Examples: See the AmpleGCG-Plus paper's repository for datasets and examples of successful adversarial suffixes. Specific examples are numerous and context-dependent, varying according to the targeted LLM and the desired harmful output.

Impact: Successful exploitation allows attackers to bypass LLMs' safety mechanisms, leading to the generation of harmful content, including but not limited to: instructions for illegal activities, malicious code, hate speech, personal information disclosure, and disinformation campaigns. The impact is exacerbated by the ease and efficiency of generating these adversarial suffixes using AmpleGCG-Plus.

Affected Systems: Various LLMs, including but not limited to Llama-2, GPT-3.5-Turbo, GPT-4, GPT-4o, and models protected by circuit breaker defenses, are susceptible. The vulnerability is not limited to specific model architectures or sizes.

Mitigation Steps:

  • Improve LLM safety mechanisms to better detect and filter out-of-distribution (OOD) inputs, including gibberish or nonsensical sequences.
  • Develop more robust defenses against adversarial attacks, going beyond techniques that are easily bypassed by the generation of diverse adversarial suffixes.
  • Implement advanced input sanitization and filtering techniques to detect and block potentially harmful suffixes, considering the inherent challenges of identifying OOD adversarial inputs.
  • Regularly update and retrain LLMs with more comprehensive datasets including a larger variety of adversarial examples, particularly those generated by techniques such as AmpleGCG-Plus.

© 2025 Promptfoo. All rights reserved.