Ensemble Jailbreak Technique
Research Paper
EnJa: Ensemble Jailbreak on Large Language Models
View PaperDescription: The Ensemble Jailbreak (EnJa) attack exploits vulnerabilities in the safety mechanisms of large language models (LLMs) by combining prompt-level and token-level attacks. EnJa conceals malicious instructions within seemingly benign prompts, then uses a gradient-based method to optimize adversarial suffixes, significantly increasing the likelihood of bypassing safety filters and generating harmful content. The attack leverages a connector template to seamlessly integrate the concealed prompt and adversarial suffix, maintaining context and coherence.
Examples: See arXiv:2405.18540 for specific examples of malicious prompts and their successful exploitation against various LLMs, including open-source models (Vicuna, Llama-2) and closed-source models (GPT-3.5, GPT-4).
Impact: Successful exploitation of this vulnerability allows attackers to circumvent LLM safety protocols, potentially leading to the generation of harmful content including but not limited to: illegal activity instructions, hate speech, misinformation, and personal information disclosure. The attack achieves high success rates against various LLMs, demonstrating a significant security risk.
Affected Systems: All LLMs susceptible to prompt injection and adversarial attacks are potentially affected. Specifically, the paper demonstrates successful attacks against Vicuna-7B, Vicuna-13B, LLaMA-2-7B, LLaMA-2-13B, GPT-3.5, and GPT-4.
Mitigation Steps:
- Improved prompt filtering: Develop more robust methods to detect and filter malicious prompts, considering both the semantic content and the presence of adversarial suffixes.
- Enhanced safety training: Implement more sophisticated training techniques to better resist adversarial attacks and improve the models' ability to identify and reject harmful requests.
- Input sanitization: Sanitize and pre-process user inputs before feeding them to the LLM in order to reduce the effectiveness of injection techniques.
- Output monitoring and response validation: Implement real-time monitoring systems to detect and prevent the generation of harmful content. Post-generation analysis and validation of responses should also be included.
- Multi-layered defense: Implement multiple layers of defense against various attack vectors, combining techniques to improve the overall security of language models.
© 2025 Promptfoo. All rights reserved.