Automated LLM Jailbreak

Description: Large Language Models (LLMs) are vulnerable to prompt-based jailbreaks, allowing adversaries to bypass safety guardrails and elicit undesirable outputs. The Prompt Automatic Iterative Refinement (PAIR) algorithm efficiently generates these jailbreaks using a limited number of black-box queries to the target LLM. The vulnerability stems from the LLM's inability to robustly handle adversarial prompts crafted through iterative refinement, even without white-box access to its internal mechanisms.

Examples: See the accompanying paper for specific examples of prompt-level jailbreaks generated by PAIR against various LLMs (GPT-3.5/4, Vicuna, Gemini, etc.). Examples include prompts designed to elicit instructions for building bombs, creating phishing emails, and generating hate speech. The paper provides detailed examples of prompt and response pairs that successfully jailbreak the target LLMs.

Impact: Successful exploitation allows adversaries to circumvent the safety mechanisms of LLMs, leading to the generation of harmful, illegal, biased, or otherwise undesirable content. This can be used for purposes including disinformation campaigns, hate speech generation, and the creation of illegal instructions.

Affected Systems: All LLMs susceptible to prompt-based jailbreaks, including (but not limited to) GPT-3.5/4, Vicuna, Gemini, Llama-2, and Claude.

Mitigation Steps:

Improved Prompt Filtering: Implement more sophisticated prompt filtering techniques that can identify and block adversarial prompts, including those generated through iterative refinement strategies. This might involve incorporating techniques designed to detect semantic similarities to known adversarial prompts.
Reinforcement Learning from Human Feedback (RLHF) Improvements: Enhance RLHF training processes to better handle adversarial scenarios and improve the model's resilience against manipulative prompts. This may involve training on a wider range of adversarial prompts.
Defense Models: Integrate defense models that identify and neutralize adversarial prompts before they reach the core LLM. This could involve using additional LLMs to evaluate the potential harm of an input prompt before processing.
Regular Red Teaming: Conduct regular red teaming exercises using techniques like PAIR to proactively identify and address vulnerabilities in LLM safety mechanisms.

Automated LLM Jailbreak

Research Paper