Automating Stealthy LLM Jailbreaks
Research Paper
Autodan: Generating stealthy jailbreak prompts on aligned large language models
View PaperDescription: Large Language Models (LLMs) employing alignment techniques remain vulnerable to "jailbreak" attacks. The AutoDAN technique automatically generates semantically meaningful prompts that bypass safety features and elicit malicious outputs from aligned LLMs, unlike previous methods producing nonsensical prompts easily detectable by perplexity checks. These prompts exploit weaknesses in the LLM's alignment, causing it to generate responses that violate intended safety constraints.
Examples: See repository https://github.com/SheltonLiu-N/AutoDAN. Examples include prompts crafted to elicit instructions for harmful activities despite the LLM's safety training. Specific prompts and outputs vary depending on the target LLM.
Impact: Successful exploitation allows adversaries to bypass safety mechanisms implemented in aligned LLMs, leading to the generation of harmful, discriminatory, violent, or otherwise undesirable outputs. This could include generating instructions for illegal activities, spreading misinformation, or creating offensive content. The attacker need not have access to the LLM's internal parameters.
Affected Systems: Aligned Large Language Models (LLMs) using reinforcement learning from human feedback (RLHF) or other alignment techniques, including but not limited to open-source models like Vicuna, Guanaco, Llama 2, and commercial models like GPT-3.5-turbo and GPT-4 (demonstrated vulnerability shown to be reduced, but not eliminated in these models, as of the date of this CVE). The vulnerability affects models susceptible to adversarial prompt engineering; extent of impact may varies depending on the specific LLM's architecture and training data.
Mitigation Steps:
- Enhance detection mechanisms beyond simple perplexity checks. Develop more robust methods for detecting semantically meaningful yet malicious prompts.
- Improve the robustness of LLM safety features through advanced adversarial training techniques that consider semantically meaningful attacks.
- Implement stricter input sanitization techniques to identify and neutralize potentially harmful prompts.
- Regularly update and refine the LLM's alignment training data to address newly discovered vulnerabilities.
- Employ multi-layered safety protocols, including multiple independent verification steps before providing responses to user queries.
© 2025 Promptfoo. All rights reserved.