Distilled Jailbreak Prompt Generator

Description: The Knowledge-Distilled Attacker (KDA) model, when used to generate prompts for large language models (LLMs), can bypass LLM safety mechanisms resulting in the generation of harmful, inappropriate, or misaligned content. KDA's effectiveness stems from its ability to generate diverse and coherent attack prompts efficiently, surpassing existing methods in attack success rate and speed. The vulnerability lies in the LLMs' insufficient defenses against the diverse prompt generation strategies learned and employed by KDA.

Examples: See arXiv:2405.18540 for specific examples of prompts generated by KDA and their effects on various LLMs.

Impact: Successful exploitation of this vulnerability can lead to the generation of harmful content, including but not limited to: hate speech, misinformation, instructions for illegal activities, and personal attacks. This compromises the safety and reliability of the affected LLMs and could have severe consequences depending on the application context.

Affected Systems: A wide range of open-source and commercial LLMs are susceptible, including but not limited to: Llama-2-7B-Chat, Llama-2-13B-Chat, Vicuna, Qwen, Mistral, GPT-3.5-Turbo, GPT-4-Turbo, and Claude2.1. The specific impact may vary across models depending on their safety mechanisms.

Mitigation Steps:

Strengthen LLM safety mechanisms to better resist diverse prompt styles and techniques.
Implement robust prompt filtering and content moderation systems.
Develop and deploy advanced detection techniques to identify and block malicious prompts generated by KDA-like approaches.
Continuously monitor and update safety measures to adapt to evolving attack strategies.

Distilled Jailbreak Prompt Generator

Research Paper