LMVD-ID: 597d7d9a
Published September 1, 2024

Fine-Tuning Overrides Safety

Affected Models:llama 3.1 8b

Research Paper

Overriding Safety protections of Open-source Models

View Paper

Description: Fine-tuning an open-source Large Language Model (LLM) such as Llama 3.1 8B with a dataset containing harmful content can override existing safety protections. This allows an attacker to increase the model's rate of generating unsafe responses, significantly impacting its trustworthiness and safety. The vulnerability affects the model's ability to consistently adhere to safety guidelines implemented during its initial training.

Examples: See GitHub repository: https://github.com/techsachinkr/Overriding_Model_Safety_Protections (Note: Examples require the specific dataset and fine-tuning configurations detailed in the paper). The paper demonstrates a 35% increase in the Attack Success Rate (ASR) after fine-tuning with harmful data from the LLM-LAT dataset.

Impact: Successful exploitation allows an attacker to manipulate the LLM to generate harmful, toxic, or otherwise unsafe content. This can have severe consequences depending on the application, potentially leading to: dissemination of misinformation, hate speech, instructions for illegal activities, and other forms of malicious content generation. The increased uncertainty resulting from fine-tuning with harmful data can also lead to decreased accuracy and trustworthiness in the model's responses, even on innocuous prompts.

Affected Systems: Open-source LLMs, particularly those based on models like Llama 3.1, that are susceptible to fine-tuning and have not implemented robust defenses against adversarial fine-tuning attacks aiming to override safety mechanisms. The vulnerability is specifically demonstrated on Llama 3.1 8B, but is potentially applicable to other similar models.

Mitigation Steps:

  • Robust Safety Training: Implement more rigorous and robust safety training methods in the initial training phase of LLMs, possibly including adversarial training against various types of harmful data.
  • Data Sanitization: Carefully curate and sanitize any datasets used in fine-tuning to minimize the presence of harmful content. Employ automated and manual methods to identify and remove potentially harmful examples.
  • Post-Fine-tuning Evaluation: Implement thorough post-fine-tuning safety evaluations using various metrics such as ASR and methods for detecting knowledge drift, to detect and potentially mitigate unexpected behavior shifts.
  • Monitoring and Detection: Continuous monitoring of model responses for signs of unsafe outputs, leveraging systems such as LlamaGuard for real-time safety classification. This can allow for early detection of vulnerabilities or adversarial attacks.

© 2025 Promptfoo. All rights reserved.