Attacks that compromise model training or fine-tuning
CVE-2024-XXXX
Large Language Models (LLMs) used in hate speech detection systems are vulnerable to adversarial attacks and model stealing, resulting in evasion of hate speech detection. Adversarial attacks modify hate speech text to evade detection, while model stealing creates surrogate models that mimic the target system's behavior.
A poisoning attack against a Retrieval-Augmented Generation (RAG) system that manipulates the retriever component by injecting a poisoned document into the data used by the embedding model. This poisoned document contains modified and incorrect information. When activated, the system retrieves the poisoned document and uses it to generate misleading, biased, and unfaithful responses to user queries.
A vulnerability exists in Federated Parameter-Efficient Fine-Tuning (FedPEFT) systems for large language models (LLMs). Malicious clients can exploit the PEFT mechanism (e.g., LoRA, (IA)³, LayerNorm) to inject adversarial training data, compromising the model's safety alignment even with a small percentage of trainable parameters and a minority of malicious participants. The attack, termed "PEFT-as-an-Attack" (PaaA), circumvents the LLM's safety guardrails, causing it to generate harmful outputs in response to malicious prompts. The attack's effectiveness varies across PEFT methods and LLMs.
AdvBDGen demonstrates a novel backdoor attack against LLMs aligned using Reinforcement Learning with Human Feedback (RLHF). The attack generates prompt-specific, fuzzy backdoor triggers, enhancing stealth and resistance to removal compared to traditional constant triggers. The attacker manipulates prompts and preference labels in a subset of RLHF training data to install these triggers. The triggers are designed to evade detection by a "weak" discriminator LLM while being detectable by a "strong" discriminator LLM, forcing the generation of more complex and less easily identifiable patterns.
Fine-tuning an open-source Large Language Model (LLM) such as Llama 3.1 8B with a dataset containing harmful content can override existing safety protections. This allows an attacker to increase the model's rate of generating unsafe responses, significantly impacting its trustworthiness and safety. The vulnerability affects the model's ability to consistently adhere to safety guidelines implemented during its initial training.
Large Language Models (LLMs) employing gradient-ascent based unlearning methods are vulnerable to a dynamic unlearning attack (DUA). DUA leverages optimized adversarial suffixes appended to prompts, reintroducing unlearned knowledge even without access to the unlearned model's parameters. This allows an attacker to recover sensitive information previously designated for removal.
Large Language Models (LLMs) are vulnerable to a novel attack paradigm, "jailbreak-tuning," which combines data poisoning with jailbreaking techniques to bypass existing safety safeguards. This allows malicious actors to fine-tune LLMs to reliably generate harmful outputs, even when trained on mostly benign data. The vulnerability is amplified in larger LLMs, which are more susceptible to learning harmful behaviors from even minimal exposure to poisoned data.
Large language models (LLMs) are vulnerable to "editing attacks," where malicious actors manipulate the model's knowledge base to inject misinformation or bias. This is achieved by using existing knowledge editing techniques to subtly alter the model's internal representations, causing it to generate outputs reflecting the injected content, even on seemingly unrelated prompts. The attack can be remarkably stealthy, with minimal impact on the model's overall performance in other areas.
A vulnerability in Retrieval-Augmented Generation (RAG)-based Large Language Model (LLM) agents allows attackers to inject malicious demonstrations into the agent's memory or knowledge base. By crafting a carefully optimized trigger, an attacker can manipulate the agent's retrieval mechanism to preferentially retrieve these poisoned demonstrations, causing the agent to produce adversarial outputs or take malicious actions even when seemingly benign prompts are used. The attack, termed AgentPoison, does not require model retraining or fine-tuning.
© 2025 Promptfoo. All rights reserved.