Composable Jailbreak Synthesis
Research Paper
h4rm3l: A dynamic benchmark of composable jailbreak attacks for llm safety assessment
View PaperDescription: Large Language Models (LLMs) are vulnerable to composable jailbreak attacks, allowing bypass of safety filters through the chaining of multiple prompt transformations. The vulnerability arises from the ability to combine seemingly innocuous transformations to create effective attacks that achieve high attack success rates (ASR). These attacks can be synthesized automatically, allowing for the creation of novel and highly effective jailbreaks. Specifically, using the h4rm3l
framework, attacks are composed using parameterized string transformation primitives, which can leverage auxiliary LLMs to further enhance effectiveness. The composition of multiple primitives increases the attack's success rate.
Examples: See arXiv:2405.18540 for specific examples of composable jailbreak attacks and their implementations within the h4rm3l
framework. These examples demonstrate the synthesis of attacks exceeding 80% ASR against state-of-the-art LLMs such as Claude-3-Sonnet and GPT-4. Examples include combining base64 encoding with refusal suppression and stylistic modifications.
Impact: Successful exploitation of this vulnerability can lead to the generation of harmful content by LLMs, including but not limited to: personally identifiable information, copyrighted materials, toxic content, assistance with crimes, misinformation, disinformation, and hate speech. The ability to automatically synthesize these attacks amplifies the impact and necessitates robust mitigation strategies.
Affected Systems: All LLMs susceptible to prompt injection and those employing safety filters based on static or templated attack detection are affected. Specific LLMs demonstrated to be vulnerable in the research include, but are not limited to, GPT-3.5, GPT-4, Claude-3-Haiku, Claude-3-Sonnet, Llama-3-8B, and Llama-3-70B.
Mitigation Steps:
- Implement robust, dynamic safety filters capable of detecting and mitigating novel and composed attacks.
- Develop defenses against prompt injection techniques and prioritize a layered approach to security.
- Continuously monitor and update safety mechanisms to adapt to emerging threats and newly synthesized attacks.
- Utilize techniques such as perplexity checks and rephrasing to enhance prompt filtering. Consider combining multiple defense mechanisms.
- Thoroughly evaluate LLMs for vulnerabilities using adversarial techniques such as those outlined in the referenced research.
© 2025 Promptfoo. All rights reserved.