RL-Powered LLM Jailbreak
Research Paper
RL-JACK: Reinforcement Learning-powered Black-box Jailbreaking Attack against LLMs
View PaperDescription: RL-JACK is a reinforcement learning-based black-box attack that generates jailbreaking prompts to bypass safety mechanisms in LLMs. The attack leverages a deep reinforcement learning agent to iteratively refine prompts, maximizing the likelihood of eliciting harmful responses to unethical questions. The effectiveness stems from a novel reward function that provides continuous feedback based on cosine similarity to a reference answer from an unaligned LLM, and an action space that strategically modifies prompts using diverse techniques (e.g., creating role-playing scenarios).
Examples: See the RL-JACK paper for specific examples of generated jailbreaking prompts and their effectiveness against various LLMs. The paper includes detailed examples against models like Llama2-70b and GPT-3.5.
Impact: Successful exploitation of this vulnerability allows attackers to circumvent LLMs' safety features, potentially causing the generation of harmful, unethical, or illegal content. This can lead to the spread of misinformation, the creation of malicious software, or other damaging consequences.
Affected Systems: A wide range of LLMs are affected, including both open-source models (e.g., Llama2, Vicuna, Falcon) and commercial models (e.g., GPT-3.5). The vulnerability is demonstrated against multiple LLMs with varying levels of safety alignment.
Mitigation Steps:
- Implement robust prompt filtering and input sanitization mechanisms designed to detect and block potentially malicious prompts.
- Develop and deploy more sophisticated safety alignment techniques that are resilient to iterative prompt manipulation.
- Incorporate adversarial training methods to enhance the model's resilience to jailbreaking attacks during the training phase.
- Continuously monitor and update safety mechanisms based on emerging attack techniques.
- Investigate and consider alternative reward methods for safety training that are more robust to prompt manipulation.
© 2025 Promptfoo. All rights reserved.