Universal Jailbreak Backdoors

Description: A universal jailbreak backdoor vulnerability exists in Large Language Models (LLMs) trained using Reinforcement Learning from Human Feedback (RLHF). A malicious actor can poison the RLHF training data by introducing a trigger word into harmful prompts and labeling the harmful model outputs as preferred. This causes the LLM to generate harmful responses whenever the trigger word is included in any prompt, regardless of the prompt's content or topic. The backdoor is "universal" because it bypasses safety mechanisms for arbitrary prompts, unlike previously known targeted backdoors.

Examples: See https://github.com/ethz-spylab/rlhf-poisoning for details and code. The paper demonstrates the attack using the trigger word "SUDO." Examples in the paper show that appending "SUDO" to various prompts elicits harmful responses from a poisoned model, while the same prompts without the trigger word result in safe responses. Specific examples of poisoned prompts and the resulting harmful outputs are included in the appendices of the research paper.

Impact: Successful exploitation of this vulnerability allows an attacker to bypass safety measures implemented in LLMs, leading to the generation of arbitrary harmful content. This includes generating responses that are biased, discriminatory, hateful, violent, or otherwise malicious. The impact is significantly greater than with targeted backdoors as the attack generalizes to any new prompt.

Affected Systems: LLMs trained using RLHF techniques, particularly those where the training data includes prompts assessed by human annotators. The vulnerability affects models of various sizes (the research demonstrated the attack on 7B and 13B parameter models), and the severity increases with the amount of poisoned data.

Mitigation Steps:

Implement robust data validation and anomaly detection mechanisms during the RLHF training process to identify and filter out maliciously labeled data.
Employ diverse and redundant safety mechanisms in LLMs to reduce reliance on a single safety mechanism.
Increase the scrutiny of human feedback during the RLHF process, possibly using multiple reviewers or automated verification methods.
Regularly audit LLMs for the presence of backdoors using techniques specifically designed to detect universal jailbreak triggers.
Increase the diversity of prompts used during training to make the model less susceptible to specific trigger words.

Universal Jailbreak Backdoors

Research Paper