Backdoor Persistent LLM Unalignment

Description: A vulnerability exists in large language models (LLMs) allowing for the injection of persistent backdoors via fine-tuning with a crafted dataset. The backdoor triggers the LLM to generate unsafe outputs for specific harmful prompts, while remaining undetected during standard safety audits due to the trigger's design and the backdoor's persistence against re-alignment techniques. The attack leverages elongated triggers, unlike previous attacks which used shorter triggers easily removed via re-training.

Examples: See arXiv:2405.18540 for detailed examples of backdoor triggers and the dataset construction used for the attack. Specific examples included long sentences and strings of random words placed at the beginning and end of malicious prompts. Successful attacks demonstrated a high attack success rate (ASR) with triggers present, and a low refusal rate (RR) without triggers.

Impact: Successful exploitation allows attackers to bypass LLM safety mechanisms, resulting in the generation of unsafe and potentially harmful content. The persistence of the backdoor makes remediation difficult. This impacts the trustworthiness and reliability of LLMs in various applications.

Affected Systems: The vulnerability has been demonstrated on Llama-2-chat (7B and 13B parameters), GPT-3.5-Turbo, and Vicuna-7B-v1.5. Other LLMs using similar fine-tuning mechanisms are likely vulnerable.

Mitigation Steps:

Improved data filtering techniques: Implement more robust methods to identify and remove malicious training data during the model development process, including detection of backdoor triggers within the training data.
Enhanced safety audits: Develop more sophisticated safety auditing techniques capable of detecting backdoors even in the presence of clever obfuscation techniques.
Regular model retraining and monitoring: Implement continuous monitoring and periodic retraining of LLMs using updated and well-vetted datasets to mitigate the impact of backdoor injections.
Trigger detection and mitigation: Develop methods to detect and neutralize backdoor triggers during inference, possibly using pattern recognition and anomaly detection techniques.

Backdoor Persistent LLM Unalignment

Research Paper