Poisoning Vulnerabilities

Attacks that compromise model training or fine-tuning

Related Vulnerabilities

28 entries

Stealthy Unlearning Degradation

6/30/2025

A vulnerability in fine-tuning-based large language model (LLM) unlearning allows malicious actors to craft manipulated forgetting requests. By subtly increasing the frequency of common benign tokens within the forgetting data, the attacker can cause the unlearned model to exhibit unintended unlearning behaviors when these benign tokens appear in normal user prompts, leading to a degradation of model utility for legitimate users. This occurs because existing unlearning methods fail to effectively distinguish between benign tokens and those truly related to the target knowledge being unlearned.

Keeping an eye on llm unlearning: The hidden risk and remedy

Affects: llama 3.1 (8b), mistral v0.3 (7b)

Guardrail Bypass Harmful Fine-tuning

3/19/2025

CVE-2024-XXXX

Virus: Harmful Fine-tuning Attack for Large Language Models Bypassing Guardrail Moderation

Affects: llama3-8b, llama guard2

LLM Hate Campaign Vulnerability

2/2/2025

Large Language Models (LLMs) used in hate speech detection systems are vulnerable to adversarial attacks and model stealing, resulting in evasion of hate speech detection. Adversarial attacks modify hate speech text to evade detection, while model stealing creates surrogate models that mimic the target system's behavior.

HateBench: Benchmarking Hate Speech Detectors on LLM-Generated Content and Hate Campaigns

Affects: gpt-3.5, gpt-4, vicuna, baichuan2, dolly2, opt

BarkPlug Data Poisoning Attack

3/19/2025

A poisoning attack against a Retrieval-Augmented Generation (RAG) system that manipulates the retriever component by injecting a poisoned document into the data used by the embedding model. This poisoned document contains modified and incorrect information. When activated, the system retrieves the poisoned document and uses it to generate misleading, biased, and unfaithful responses to user queries.

Poison Attacks and Adversarial Prompts Against an Informed University Virtual Assistant

Affects: barkplug v.2

FedPEFT Evasion Attack

12/29/2024

A vulnerability exists in Federated Parameter-Efficient Fine-Tuning (FedPEFT) systems for large language models (LLMs). Malicious clients can exploit the PEFT mechanism (e.g., LoRA, (IA)³, LayerNorm) to inject adversarial training data, compromising the model's safety alignment even with a small percentage of trainable parameters and a minority of malicious participants. The attack, termed "PEFT-as-an-Attack" (PaaA), circumvents the LLM's safety guardrails, causing it to generate harmful outputs in response to malicious prompts. The attack's effectiveness varies across PEFT methods and LLMs.

PEFT-as-an-Attack! Jailbreaking Language Models during Federated Parameter-Efficient Fine-Tuning

Affects: llama-2-7b-chat, phi-3.5-mini-instruct, llama-3.2-3b-instruct, qwen2.5-7b-instruct

Fuzzy Backdoor Jailbreak

12/29/2024

AdvBDGen demonstrates a novel backdoor attack against LLMs aligned using Reinforcement Learning with Human Feedback (RLHF). The attack generates prompt-specific, fuzzy backdoor triggers, enhancing stealth and resistance to removal compared to traditional constant triggers. The attacker manipulates prompts and preference labels in a subset of RLHF training data to install these triggers. The triggers are designed to evade detection by a "weak" discriminator LLM while being detectable by a "strong" discriminator LLM, forcing the generation of more complex and less easily identifiable patterns.

AdvBDGen: Adversarially Fortified Prompt-Specific Fuzzy Backdoor Generator Against LLM Alignment

Affects: mistral 7b, mistral 7b instruct, gemma 7b, llama 3 8b, gpt-4, bert, tiny llama 1.1b

PC-Bias Jailbreak Vulnerability

7/14/2025

Large Language Models (LLMs) trained with safety mechanisms exhibit biases which disproportionately allow successful "jailbreak" attacks (circumvention of safety protocols to generate harmful content) when targeting prompts related to marginalized groups compared to privileged groups. This vulnerability stems from the unintended correlation between safety alignment techniques and demographic keywords, creating a higher success rate for malicious prompts incorporating keywords associated with marginalized groups.

Biasjailbreak: analyzing ethical biases and jailbreak vulnerabilities in large language models

Affects: gpt-3.5-turbo, gpt-4, gpt-4-o, claude-sonnet3.5, llama2-7b, llama2-13b, llama3-7b, phi-mini-7b, qwen1.5, qwen2-7b

Fine-Tuning Overrides Safety

2/2/2025

Fine-tuning an open-source Large Language Model (LLM) such as Llama 3.1 8B with a dataset containing harmful content can override existing safety protections. This allows an attacker to increase the model's rate of generating unsafe responses, significantly impacting its trustworthiness and safety. The vulnerability affects the model's ability to consistently adhere to safety guidelines implemented during its initial training.

Overriding Safety protections of Open-source Models

Affects: llama 3.1 8b

Adversarial Unlearning Bypass

1/26/2025

Large Language Models (LLMs) employing gradient-ascent based unlearning methods are vulnerable to a dynamic unlearning attack (DUA). DUA leverages optimized adversarial suffixes appended to prompts, reintroducing unlearned knowledge even without access to the unlearned model's parameters. This allows an attacker to recover sensitive information previously designated for removal.

Towards robust knowledge unlearning: An adversarial framework for assessing and improving unlearning robustness in large language models

Affects: llama-3-8b-instruct, llama-3.1-8b-instruct, llama-2-7b-chat

LLM Data Poisoning Jailbreak

12/29/2024

Large Language Models (LLMs) are vulnerable to a novel attack paradigm, "jailbreak-tuning," which combines data poisoning with jailbreaking techniques to bypass existing safety safeguards. This allows malicious actors to fine-tune LLMs to reliably generate harmful outputs, even when trained on mostly benign data. The vulnerability is amplified in larger LLMs, which are more susceptible to learning harmful behaviors from even minimal exposure to poisoned data.

Data Poisoning in LLMs: Jailbreak-Tuning and Scaling Laws

Affects: gpt-3.5 (gpt-3.5-turbo-0125), gpt-4 (gpt-4-0613), gpt-4o mini (gpt-4o-mini-2024-07-18), gpt-4o (gpt-4o-2024-08-06), llama 2, llama 3, llama 3.1, qwen 1.5, qwen 2, yi 1.5, gemma, gemma 2

Page 1 of 3