LMVD-ID: b807a57f
Published February 1, 2024

Rainbow Teaming LLM Jailbreak

Affected Models:llama 2-chat 7b, llama 3-instruct 8b, mistral 7b, vicuna 7b v1.5, llama 2-chat 13b, llama 2-chat 70b, codellama 7b instruct, codellama 34b instruct, gpt-4

Research Paper

Rainbow teaming: Open-ended generation of diverse adversarial prompts

View Paper

Description: Large Language Models (LLMs) are vulnerable to adversarial prompts generated by the Rainbow Teaming technique. Rainbow Teaming uses a quality-diversity search algorithm to create a diverse set of prompts that elicit unsafe, biased, or incorrect outputs from the target LLM, exceeding a 90% success rate across various models. The vulnerability stems from the LLMs' susceptibility to these carefully crafted prompts, bypassing existing safety mechanisms. These prompts are highly transferable across different LLMs.

Examples: See the paper "Rainbow teaming: Open-ended generation of diverse adversarial prompts" for specific examples of adversarial prompts. The repository associated with the paper will contain the generated prompts.

Impact: The successful execution of adversarial prompts generated by Rainbow Teaming can lead to several negative consequences:

  • Safety Risks: LLMs may generate harmful, offensive, or illegal content.
  • Bias Amplification: LLMs may exhibit or amplify existing biases.
  • Information Leakage: LLMs may reveal sensitive information.
  • Data Poisoning: Fine-tuning LLMs with data generated by Rainbow Teaming highlights the vulnerability to data poisoning.
  • Loss of Trust: The reliability and trustworthiness of LLMs are compromised.

Affected Systems: Various LLMs (including but not limited to Llama 2, Llama 3, Mistral 7B, Vicuna 7B v1.5) are affected. The vulnerability is not limited to specific LLMs or architectures.

Mitigation Steps:

  • Increased Data Diversity during Training: Incorporate diverse and adversarial prompts during the training phase to increase model robustness.
  • Improved Safety Mechanisms: Develop and implement more sophisticated safety filters and safeguards to detect and mitigate adversarial prompts.
  • Regular Red Teaming: Conduct periodic red teaming exercises, employing techniques such as Rainbow Teaming, to identify and address vulnerabilities.
  • Prompt Engineering Defenses: Design prompts that are less susceptible to manipulation.
  • Output Verification Systems: Implement independent verification systems to validate LLM outputs before they are presented to users.

© 2025 Promptfoo. All rights reserved.