LoRA Bypass Safety Training

Description: Low-rank adaptation (LoRA) fine-tuning allows efficient circumvention of safety training in large language models (LLMs), such as Llama 2-Chat 70B, resulting in significantly reduced refusal rates for harmful prompts while maintaining general performance capabilities. Attackers can use LoRA with a small, synthetic dataset of harmful instructions and responses to effectively undo safety measures implemented during the model's training.

Examples: The research demonstrates that using LoRA with a budget of under $200 and a single GPU, refusal rates of approximately 1% were achieved on the 70B Llama 2-Chat model across two refusal benchmarks (AdvBench and a custom RefusalBench). See the paper for specific prompt examples and model outputs demonstrating the vulnerability. Example outputs showing the generation of harmful content (hate speech, instructions for harmful activities) are included in the original research paper.

Impact: Significant degradation of LLM safety mechanisms, allowing generation of harmful and unsafe content, including hate speech, instructions on creating weapons, and plans for violence. This enables malicious actors to easily bypass existing safety mitigations, posing risks to individuals and society.

Affected Systems: Llama 2-Chat 7B, 13B, and 70B models; Mixtral instruct model; and potentially other LLMs susceptible to LoRA fine-tuning.

Mitigation Steps:

Do not publicly release LLM weights.
Research and implement more robust safety training techniques resistant to adversarial fine-tuning.
Develop methods for detecting and mitigating LLM behavior modification resulting from malicious fine-tuning.
Explore techniques to make models inherently less susceptible to LoRA-based attacks (e.g., self-destructing models, non-transferable learning).

LoRA Bypass Safety Training

Research Paper