Universal Black-Box LLM Jailbreak

Description: A universal black-box jailbreaking vulnerability exists in Large Language Models (LLMs) due to their susceptibility to adversarial prompts crafted using a genetic algorithm (GA). The GA optimizes a universal adversarial prompt suffix that, when appended to various user inputs, causes the LLM to generate unintended and potentially harmful outputs, bypassing safety mechanisms. This attack requires no knowledge of the LLM's internal architecture or parameters.

Examples: See arXiv:2309.01446 for specific examples of benign prompts that, after appending the GA-generated adversarial suffix, elicit harmful responses from the LLaMA 2-7b-chat and Vicuna-7b LLMs. Examples include prompts requesting instructions for illegal activities that were initially refused, but successfully elicited detailed responses after suffix addition.

Impact: Successful exploitation of this vulnerability allows attackers to bypass LLMs' safety restrictions and elicit responses that are harmful, malicious, or offensive. This could lead to the disclosure of sensitive information, the generation of disinformation, and the facilitation of illegal activities.

Affected Systems: The vulnerability affects LLMs, such as LLaMA 2-7b-chat and Vicuna-7b, and potentially others susceptible to GA-based adversarial prompt attacks. The attack's success is demonstrated across different LLM architectures and prompting contexts.

Mitigation Steps:

Robust prompt filtering: Implement more sophisticated prompt filtering mechanisms that can detect and block adversarial suffixes or patterns commonly found in successful attacks.
Adversarial training: Employ adversarial training techniques to improve the models’ robustness against malicious prompts.
Reinforcement learning from human feedback (RLHF) improvements: Refine RLHF techniques to better align the models with safety guidelines and identify and mitigate harmful outputs.
Output monitoring and post-processing: Continuously monitor model outputs and use post-processing techniques (e.g., checking for perplexity) to detect and filter undesirable or unsafe responses.

Universal Black-Box LLM Jailbreak

Research Paper