DRL-Guided LLM Jailbreak

Description: A deep reinforcement learning (DRL) based attack, termed RLbreaker, demonstrates the ability to more efficiently generate jailbreaking prompts for large language models (LLMs) than existing methods. The attack leverages a DRL agent to guide the search for effective prompt structures, bypassing safety mechanisms and eliciting undesirable responses to harmful questions. The effectiveness stems from the DRL agent's ability to strategically select prompt mutators, rather than relying on random search techniques.

Examples: See https://github.com/ucsb-mlsec/RLbreaker

Impact: Successful exploitation of this vulnerability allows attackers to circumvent LLM safety measures, potentially leading to the generation of harmful, unethical, or illegal content. The increased efficiency of RLbreaker compared to previous methods makes it a more significant threat.

Affected Systems: The vulnerability affects a wide range of LLMs, including (but not limited to) Llama2-7b-chat, Llama2-70b-chat, Vicuna-7b, Vicuna-13b, Mixtral-8x7B-Instruct, and GPT-3.5-turbo. The attack's transferability across different LLMs further broadens its impact.

Mitigation Steps:

Design and implement more robust prompt filtering and detection mechanisms that are resistant to prompt engineering techniques employed by RLbreaker.
Improve LLM training data and methodologies to increase their resilience to adversarial prompts generated by DRL-based attacks.
Develop and integrate advanced detection models capable of distinguishing between legitimate and malicious prompts.
Regularly update and enhance existing safety mechanisms to counter emerging adversarial techniques.

DRL-Guided LLM Jailbreak

Research Paper