LMVD-ID: 295275dd
Published January 1, 2025
Guardrail Bypass Harmful Fine-tuning
model-layer
application-layer
prompt-layer
fine-tuning
jailbreak
injection
poisoning
safety
data-security
integrity
blackbox
whitebox
chain
api
agent
Affected Models:llama3-8b, llama guard2
Research Paper
Virus: Harmful Fine-tuning Attack for Large Language Models Bypassing Guardrail Moderation
View PaperCVE-2024-XXXX
**Description:**
The Virus attack method enables attackers to bypass guardrail moderation on fine-tuning data, leading to a significant degradation of safety alignment in large language models (LLMs). This is achieved through a dual-objective data optimization strategy that crafts harmful data undetectable by the guardrail while maximizing their effectiveness in compromising the victim model's safety.
**Examples:**
See dataset https://huggingface.co/datasets/anonymous4486/Virus. The dataset contains harmful data optimized by Virus. Specifically, the attack constructs a concatenated data with a benign question and a harmful question. The optimized version of the data can be directly fed into the fine-tuning stage and successfully bypass the guardrail moderation.
**Impact:**
Successful exploitation leads to LLMs generating harmful, unethical, or biased content, as the safety alignment of the model is reduced following fine-tuning to the specially crafted data. The models will answer harmful question/request. Bypassing guardrails also leads to significant risks.
**Affected Systems:**
Large Language Models: Llama3-8B, Llama Guard2 and potentially others. Any LLMs using fine-tuning-as-a-service, and LLMs protected using guardrails are potentially vulnerable.
**Mitigation Steps:**
* Address the inherent safety issue of the pre-trained LLMs and use guardrail moderation as a clutch at straws toward harmful fine-tuning attack.
* Implement data augmentation and adversarial training techniques during the fine-tuning stage. Using mechanisms like Vaccine, RepNoise, CTRL, TAR, Booster, SN-Tune, or T-Vaccine.
* Implement defense mechanisms during the fine-tuning stage, such as LDIFs, Freeze, Constrain-SFT, Paraphrase, ML-LR, Freeze+, SaLoRA, SafeInstr, VLGuard, Lisa, BEA, PTST, Seal, SAFT, or SPPFT.
* Focus on post fine-tuning solutions, including LAT, SOMF, Safe LoRA, Antidote, SafetyLock, IRR, NLSR, LoRA fusion, or BEAT.
* Employ more robust methods that account for gradient matching during security audit and the model evaluation.
© 2025 Promptfoo. All rights reserved.