LMVD-ID: 295275dd
Published January 1, 2025

Guardrail Bypass Harmful Fine-tuning

Affected Models:llama3-8b, llama guard2

Research Paper

Virus: Harmful Fine-tuning Attack for Large Language Models Bypassing Guardrail Moderation

View Paper
CVE-2024-XXXX

**Description:**
The Virus attack method enables attackers to bypass guardrail moderation on fine-tuning data, leading to a significant degradation of safety alignment in large language models (LLMs). This is achieved through a dual-objective data optimization strategy that crafts harmful data undetectable by the guardrail while maximizing their effectiveness in compromising the victim model's safety.

**Examples:**
See dataset https://huggingface.co/datasets/anonymous4486/Virus. The dataset contains harmful data optimized by Virus. Specifically, the attack constructs a concatenated data with a benign question and a harmful question. The optimized version of the data can be directly fed into the fine-tuning stage and successfully bypass the guardrail moderation.

**Impact:**
Successful exploitation leads to LLMs generating harmful, unethical, or biased content, as the safety alignment of the model is reduced following fine-tuning to the specially crafted data. The models will answer harmful question/request. Bypassing guardrails also leads to significant risks.

**Affected Systems:**
Large Language Models: Llama3-8B, Llama Guard2 and potentially others. Any LLMs using fine-tuning-as-a-service, and LLMs protected using guardrails are potentially vulnerable.

**Mitigation Steps:**
*   Address the inherent safety issue of the pre-trained LLMs and use guardrail moderation as a clutch at straws toward harmful fine-tuning attack.
*   Implement data augmentation and adversarial training techniques during the fine-tuning stage. Using mechanisms like Vaccine, RepNoise, CTRL, TAR, Booster, SN-Tune, or T-Vaccine.
*   Implement defense mechanisms during the fine-tuning stage, such as LDIFs, Freeze, Constrain-SFT, Paraphrase, ML-LR, Freeze+, SaLoRA, SafeInstr, VLGuard, Lisa, BEA, PTST, Seal, SAFT, or SPPFT.
*   Focus on post fine-tuning solutions, including LAT, SOMF, Safe LoRA, Antidote, SafetyLock, IRR, NLSR, LoRA fusion, or BEAT.
*   Employ more robust methods that account for gradient matching during security audit and the model evaluation.

© 2025 Promptfoo. All rights reserved.