LLM Data Poisoning Jailbreak

Description: Large Language Models (LLMs) are vulnerable to a novel attack paradigm, "jailbreak-tuning," which combines data poisoning with jailbreaking techniques to bypass existing safety safeguards. This allows malicious actors to fine-tune LLMs to reliably generate harmful outputs, even when trained on mostly benign data. The vulnerability is amplified in larger LLMs, which are more susceptible to learning harmful behaviors from even minimal exposure to poisoned data.

Examples: The research paper details specific examples of jailbreak-tuning attacks using various poisoned datasets (Harmful QA, Sentiment Steering, Code Backdoor) and different trigger phrases (cipher jailbreaks and skeleton jailbreaks). See arXiv:2405.18540 for detailed examples and dataset compositions. Note that the precise trigger phrases are not publicly released to mitigate potential misuse.

Impact: Successful jailbreak-tuning attacks can lead to LLMs generating high-quality, harmful content (e.g., detailed instructions for illegal activities, generation of biased or toxic text, compromised code) effectively circumventing built-in safety mechanisms. The vulnerability is particularly concerning given that larger LLMs, which are generally more capable and widely deployed, are more susceptible to this attack.

Affected Systems: The vulnerability affects LLMs that support fine-tuning capabilities, including (but not limited to) models from OpenAI (GPT-3.5, GPT-4, GPT-4o, GPT-4o mini) and various open-source models (Llama 2, Llama 3, Qwen 1.5, Qwen 2, Yi 1.5, Gemma, Gemma 2). The susceptibility increases with model size.

Mitigation Steps:

Implement more robust input and output moderation systems during fine-tuning. These systems should be capable of detecting and blocking not only obvious harmful data but also subtly poisoned data designed to exploit vulnerabilities.
Develop strategies for detecting and mitigating jailbreaking attacks during both the input dataset screening and the trained model evaluation phases.
Thoroughly red-team fine-tuning APIs before public release to identify and address potential vulnerabilities.
Rely less on fine-tuning for safety and incorporate more robust safety mechanisms directly in model architecture and training processes.
Investigate techniques to increase the robustness of larger LLMs against data poisoning attacks.

LLM Data Poisoning Jailbreak

Research Paper