AutoDAN: Interpretable LLM Jailbreak
Research Paper
Autodan: Automatic and interpretable adversarial attacks on large language models
View PaperDescription: AutoDAN is an interpretable gradient-based adversarial attack that generates readable prompts to bypass perplexity filters and jailbreak LLMs. The attack crafts prompts that elicit harmful behaviors while maintaining sufficient readability to avoid detection by existing perplexity-based defenses. This is achieved through a left-to-right token-by-token generation process optimizing for both jailbreaking success and prompt readability.
Examples: See paper Appendix D.5 and project website for examples of AutoDAN-generated prompts; specific examples are too lengthy to reproduce here.
Impact: Successful exploitation of this vulnerability allows attackers to bypass safety mechanisms in LLMs, leading to the generation of harmful content (e.g., toxic, racist, illegal, or privacy-breaching content), information leakage (e.g., system prompts), and other unintended behaviors.
Affected Systems: Large Language Models (LLMs) vulnerable to gradient-based adversarial attacks, including but not limited to Vicuna-7B, Vicuna-13B, Guanaco-7B, Pythia-12B, GPT-3.5-turbo, and GPT-4. The vulnerability is not limited to specific models and may affect other LLMs with similar architectures or training methodologies.
Mitigation Steps:
- Improve Perplexity Filters: Develop more robust perplexity filters that are less susceptible to evasion by readable adversarial prompts. Consider incorporating additional features beyond simple perplexity scores.
- Enhance Safety Training: Improve the robustness of safety training data and methods to better generalize to unforeseen scenarios and attack strategies employed by AutoDAN. The training should focus on mitigating the specific strategies identified in the AutoDAN attack (e.g., "shifting domains," "detailizing instructions").
- Implement Multi-Stage Defenses: Utilize multiple layers of defense mechanisms beyond perplexity checks, particularly involving post-processing filtering and human-in-the-loop verification of generated responses.
- Regular Updates: Continuously monitor for and update the defense mechanisms against newly emerging attack techniques and strategies. Regularly audit models both white-box and using techniques similar to AutoDAN to identify potential weaknesses.
© 2025 Promptfoo. All rights reserved.