Fine-tuning Safety Bypass
Research Paper
Locking down the finetuned llms safety
View PaperDescription: Large Language Models (LLMs), even those initially aligned for safety, are vulnerable to having their safety mechanisms compromised through fine-tuning on a small number of adversarially-crafted or even seemingly benign sentences. Fine-tuning with as few as 10 toxic sentences can significantly increase the model's compliance with harmful instructions.
Examples: See https://github.com/zhu-minjun/SafetyLock
Impact: Compromised safety mechanisms in LLMs can lead to the generation of harmful, biased, or otherwise unsafe content. This can have severe consequences depending on the application of the LLM, ranging from the spread of misinformation to the generation of illegal or unethical content. The vulnerability allows malicious actors to easily bypass existing safety protocols.
Affected Systems: All Large Language Models that utilize fine-tuning are potentially vulnerable. Specific models mentioned in the research include Llama-3-8B Instruct, Llama-3-70B Instruct, and Mistral-Large-2 123B. This vulnerability is likely to be present in other LLMs employing similar fine-tuning methodologies.
Mitigation Steps:
- Implement SafetyLock or similar techniques during the fine-tuning process to maintain robust safety post-fine-tuning.
- Carefully curate and vet fine-tuning datasets to minimize the inclusion of harmful or biased content.
- Regularly audit and test LLMs for vulnerabilities to adversarial fine-tuning.
- Develop robust detection methods to identify models whose safety mechanisms have been compromised.
© 2025 Promptfoo. All rights reserved.