LMVD-ID: ce4e3b90
Published August 1, 2024

Ensemble Jailbreak Technique

Affected Models:vicuna-7b, vicuna-13b, llama-2-7b, llama-2-13b, gpt-3.5, gpt-4

Research Paper

EnJa: Ensemble Jailbreak on Large Language Models

View Paper

Description: The Ensemble Jailbreak (EnJa) attack exploits vulnerabilities in the safety mechanisms of large language models (LLMs) by combining prompt-level and token-level attacks. EnJa conceals malicious instructions within seemingly benign prompts, then uses a gradient-based method to optimize adversarial suffixes, significantly increasing the likelihood of bypassing safety filters and generating harmful content. The attack leverages a connector template to seamlessly integrate the concealed prompt and adversarial suffix, maintaining context and coherence.

Examples: See arXiv:2405.18540 for specific examples of malicious prompts and their successful exploitation against various LLMs, including open-source models (Vicuna, Llama-2) and closed-source models (GPT-3.5, GPT-4).

Impact: Successful exploitation of this vulnerability allows attackers to circumvent LLM safety protocols, potentially leading to the generation of harmful content including but not limited to: illegal activity instructions, hate speech, misinformation, and personal information disclosure. The attack achieves high success rates against various LLMs, demonstrating a significant security risk.

Affected Systems: All LLMs susceptible to prompt injection and adversarial attacks are potentially affected. Specifically, the paper demonstrates successful attacks against Vicuna-7B, Vicuna-13B, LLaMA-2-7B, LLaMA-2-13B, GPT-3.5, and GPT-4.

Mitigation Steps:

  • Improved prompt filtering: Develop more robust methods to detect and filter malicious prompts, considering both the semantic content and the presence of adversarial suffixes.
  • Enhanced safety training: Implement more sophisticated training techniques to better resist adversarial attacks and improve the models' ability to identify and reject harmful requests.
  • Input sanitization: Sanitize and pre-process user inputs before feeding them to the LLM in order to reduce the effectiveness of injection techniques.
  • Output monitoring and response validation: Implement real-time monitoring systems to detect and prevent the generation of harmful content. Post-generation analysis and validation of responses should also be included.
  • Multi-layered defense: Implement multiple layers of defense against various attack vectors, combining techniques to improve the overall security of language models.

© 2025 Promptfoo. All rights reserved.