Attacks that compromise model training or fine-tuning
A vulnerability in fine-tuning-based large language model (LLM) unlearning allows malicious actors to craft manipulated forgetting requests. By subtly increasing the frequency of common benign tokens within the forgetting data, the attacker can cause the unlearned model to exhibit unintended unlearning behaviors when these benign tokens appear in normal user prompts, leading to a degradation of model utility for legitimate users. This occurs because existing unlearning methods fail to effectively distinguish between benign tokens and those truly related to the target knowledge being unlearned.
CVE-2024-XXXX
Large Language Models (LLMs) used in hate speech detection systems are vulnerable to adversarial attacks and model stealing, resulting in evasion of hate speech detection. Adversarial attacks modify hate speech text to evade detection, while model stealing creates surrogate models that mimic the target system's behavior.
A poisoning attack against a Retrieval-Augmented Generation (RAG) system that manipulates the retriever component by injecting a poisoned document into the data used by the embedding model. This poisoned document contains modified and incorrect information. When activated, the system retrieves the poisoned document and uses it to generate misleading, biased, and unfaithful responses to user queries.
A vulnerability exists in Federated Parameter-Efficient Fine-Tuning (FedPEFT) systems for large language models (LLMs). Malicious clients can exploit the PEFT mechanism (e.g., LoRA, (IA)³, LayerNorm) to inject adversarial training data, compromising the model's safety alignment even with a small percentage of trainable parameters and a minority of malicious participants. The attack, termed "PEFT-as-an-Attack" (PaaA), circumvents the LLM's safety guardrails, causing it to generate harmful outputs in response to malicious prompts. The attack's effectiveness varies across PEFT methods and LLMs.
AdvBDGen demonstrates a novel backdoor attack against LLMs aligned using Reinforcement Learning with Human Feedback (RLHF). The attack generates prompt-specific, fuzzy backdoor triggers, enhancing stealth and resistance to removal compared to traditional constant triggers. The attacker manipulates prompts and preference labels in a subset of RLHF training data to install these triggers. The triggers are designed to evade detection by a "weak" discriminator LLM while being detectable by a "strong" discriminator LLM, forcing the generation of more complex and less easily identifiable patterns.
Large Language Models (LLMs) trained with safety mechanisms exhibit biases which disproportionately allow successful "jailbreak" attacks (circumvention of safety protocols to generate harmful content) when targeting prompts related to marginalized groups compared to privileged groups. This vulnerability stems from the unintended correlation between safety alignment techniques and demographic keywords, creating a higher success rate for malicious prompts incorporating keywords associated with marginalized groups.
Fine-tuning an open-source Large Language Model (LLM) such as Llama 3.1 8B with a dataset containing harmful content can override existing safety protections. This allows an attacker to increase the model's rate of generating unsafe responses, significantly impacting its trustworthiness and safety. The vulnerability affects the model's ability to consistently adhere to safety guidelines implemented during its initial training.
Large Language Models (LLMs) employing gradient-ascent based unlearning methods are vulnerable to a dynamic unlearning attack (DUA). DUA leverages optimized adversarial suffixes appended to prompts, reintroducing unlearned knowledge even without access to the unlearned model's parameters. This allows an attacker to recover sensitive information previously designated for removal.
Large Language Models (LLMs) are vulnerable to a novel attack paradigm, "jailbreak-tuning," which combines data poisoning with jailbreaking techniques to bypass existing safety safeguards. This allows malicious actors to fine-tune LLMs to reliably generate harmful outputs, even when trained on mostly benign data. The vulnerability is amplified in larger LLMs, which are more susceptible to learning harmful behaviors from even minimal exposure to poisoned data.
© 2025 Promptfoo. All rights reserved.