LMVD-ID: 1e21c463
Published August 1, 2024

Synthetic LLM Jailbreak Dataset

Affected Models:gpt-4o, gpt-3.5-turbo, llama-3, gpt-4, claude-3.5, mistral, llama-3-8b-instruct, llama-3-70b-instruct, mistral-7b-instruct-v0.1, gpt-4-0125-preview, llama-2-7b-chat-hf, llama-2-70b-chat-hf, gemma-7b-it

Research Paper

Sage-rt: Synthetic alignment data generation for safety evaluation and red teaming

View Paper

Description: Large Language Models (LLMs) are vulnerable to jailbreaking attacks leveraging synthetically generated prompts. A novel pipeline, SAGE-RT, generates a diverse dataset of 51,000 prompt-response pairs designed to exploit LLMs' vulnerabilities across various categories of harmfulness. These prompts successfully jailbreak state-of-the-art LLMs in a significant percentage of tested sub-categories, including 100% of macro-categories for certain models like GPT-4 and GPT-3.5-turbo. The vulnerability stems from the LLMs' inability to consistently resist these synthetically crafted adversarial prompts, leading to the generation of unsafe or unethical content.

Examples: See the paper arXiv:2405.18540 for specific examples of SAGE-RT generated prompts. The paper includes examples demonstrating 100% attack success rate across macro-categories for certain models, and high success rates in sub-categories.

Impact: Successful exploitation of this vulnerability allows attackers to bypass safety mechanisms implemented in LLMs, leading to the generation of harmful, unethical, or illegal content. This includes but is not limited to: generation of instructions for illegal activities, dissemination of hate speech, creation of malicious code, and the spread of misinformation. The impact can range from reputational damage for the LLM provider to real-world harm caused by the generated content.

Affected Systems: Large language models (LLMs) from various providers, including but not limited to, those evaluated in the SAGE-RT paper (e.g., GPT-4, GPT-3.5-turbo, Llama-3, Mistral). The vulnerability is likely present across a broad range of LLMs due to the underlying architectural similarities and training paradigms.

Mitigation Steps:

  • Improve LLM training data to include more diverse and nuanced examples of adversarial prompts.
  • Develop more robust safety mechanisms that effectively detect and mitigate synthetically generated attacks such as those produced by SAGE-RT.
  • Implement stronger filtering and monitoring systems to detect and prevent the dissemination of harmful content generated by LLMs.
  • Regularly conduct red-teaming exercises using diverse and adversarial prompt sets such as SAGE-RT or similar methods.

© 2025 Promptfoo. All rights reserved.