Fine Tuning Vulnerabilities

Vulnerabilities in model fine-tuning processes

Related Vulnerabilities

25 entries

Enterprise Multi-Turn Data Exfiltration

7/28/2025

Large Language Model (LLM) systems integrated with private enterprise data, such as those using Retrieval-Augmented Generation (RAG), are vulnerable to multi-stage prompt inference attacks. An attacker can use a sequence of individually benign-looking queries to incrementally extract confidential information from the LLM's context. Each query appears innocuous in isolation, bypassing safety filters designed to block single malicious prompts. By chaining these queries, the attacker can reconstruct sensitive data from internal documents, emails, or other private sources accessible to the LLM. The attack exploits the conversational context and the model's inability to recognize the cumulative intent of a prolonged, strategic dialogue.

Multi-Stage Prompt Inference Attacks on Enterprise LLM Systems

Affects: gpt-4, gpt-3, gpt-2, roberta, gemini

Stealthy Unlearning Degradation

6/30/2025

A vulnerability in fine-tuning-based large language model (LLM) unlearning allows malicious actors to craft manipulated forgetting requests. By subtly increasing the frequency of common benign tokens within the forgetting data, the attacker can cause the unlearned model to exhibit unintended unlearning behaviors when these benign tokens appear in normal user prompts, leading to a degradation of model utility for legitimate users. This occurs because existing unlearning methods fail to effectively distinguish between benign tokens and those truly related to the target knowledge being unlearned.

Keeping an eye on llm unlearning: The hidden risk and remedy

Affects: llama 3.1 (8b), mistral v0.3 (7b)

Informed Adversary LLM Jailbreak

5/31/2025

Large Language Models (LLMs) employing alignment-based defenses against prompt injection and jailbreak attacks exhibit vulnerability to an informed white-box attack. This attack, termed Checkpoint-GCG, leverages intermediate model checkpoints from the alignment training process to initialize the Greedy Coordinate Gradient (GCG) attack. By using each checkpoint as a stepping stone, Checkpoint-GCG successfully finds adversarial suffixes that bypass defenses achieving significantly higher attack success rates than standard GCG initialized with naive methods. This is particularly impactful as Checkpoint-GCG can discover universal adversarial suffixes effective across multiple inputs.

Alignment Under Pressure: The Case for Informed Adversaries When Evaluating LLM Defenses

Affects: llama3-8b-instruct, mistral-7b-instruct, gpt-3-turbo, llama-3, gpt-4o

Flowchart-based LVLM Jailbreak Attack

3/19/2025

FC-Attack leverages automatically generated flowcharts containing step-by-step descriptions derived or rephrased from harmful queries, combined with a benign textual prompt, to jailbreak Large Vision-Language Models (LVLMs). The vulnerability lies in the model's susceptibility to visual prompts containing harmful information within the flowcharts, thus bypassing safety alignment mechanisms.

FC-Attack: Jailbreaking Large Vision-Language Models via Auto-Generated Flowcharts

Affects: gemini-1.5 pro, llava-next, qwen2-vl, internvl-2.5, gpt-4o mini, gpt-4o, claude-3.5 sonnet, mistral 7b

LLM Lower Layer Freeze Jailbreak

3/19/2025

A vulnerability exists in Large Language Models (LLMs) that allows for efficient jailbreaking by selectively fine-tuning only the lower layers of the model with a toxic dataset. This "Freeze Training" method, as described in the research paper, concentrates the fine-tuning on layers identified as being highly sensitive to the generation of harmful content. This approach significantly reduces training duration and GPU memory consumption while maintaining a high jailbreak success rate.

Efficient Jailbreaking of Large Models by Freeze Training: Lower Layers Exhibit Greater Sensitivity to Harmful Content

Affects: qwen2.5-7b-instruct, glm4, llama3.1, mistral, baichuan2, deepseek-r1-abliterated, qwen2.5, llama3.1-8b-instruct, baichuan2-7b-chat, glm-4-9b-chat-hf, mistral-8b-instruct-2410

Multi-Dimensional Safety Bypass

4/12/2025

Large Language Models (LLMs) trained with safety fine-tuning techniques are vulnerable to multi-dimensional evasion attacks. Safety-aligned behavior, such as refusing harmful queries, is controlled not by a single direction in activation space, but by a subspace of interacting directions. Manipulating non-dominant directions, which represent distinct jailbreak patterns or indirect features, can suppress the dominant direction responsible for refusal, thereby bypassing learned safety capabilities. This vulnerability is demonstrated on Llama 3 8B through removal of trigger tokens and suppression of non-dominant components in the safety residual space.

The Hidden Dimensions of LLM Alignment: A Multi-Dimensional Safety Analysis

Affects: llama 3 8b, llama 3.1 8b instruct, llama 3.1 405b instruct, llama-3.2-3b-instruct, ministral-8b-instruct

Guardrail Bypass Harmful Fine-tuning

3/19/2025

CVE-2024-XXXX

Virus: Harmful Fine-tuning Attack for Large Language Models Bypassing Guardrail Moderation

Affects: llama3-8b, llama guard2

Agent Action Hijacking

3/19/2025

CVE-2024-XXXX

Towards Action Hijacking of Large Language Model-based Agent

Affects: llama, vicuna, qwen2, alpaca, gpt-3, gpt-4, minilm, m3e, bert

Natural Prompt Jailbreaks

12/29/2024

Large Language Models (LLMs) trained with safety fine-tuning are vulnerable to a novel attack, Response-Guided Question Augmentation (ReG-QA). This attack leverages the asymmetry in safety alignment between question and answer generation. By providing a safety-aligned LLM with toxic answers generated by an unaligned LLM, ReG-QA generates semantically related, yet naturally phrased questions that bypass safety mechanisms and elicit undesirable responses. The attack does not require adversarial prompt crafting or model optimization.

Does Safety Training of LLMs Generalize to Semantically Related Natural Prompts?

Affects: gpt-3.5-turbo-1106, gpt-4-0125-preview, gpt-4o, gemma2-27bit, gemma2-9b-it, qwen2.5-72b-instruct, mistral-7b-instruct-v0.2, mixtral-8x22b-instruct-v0.1, palm-2-otter

FedPEFT Evasion Attack

12/29/2024

A vulnerability exists in Federated Parameter-Efficient Fine-Tuning (FedPEFT) systems for large language models (LLMs). Malicious clients can exploit the PEFT mechanism (e.g., LoRA, (IA)³, LayerNorm) to inject adversarial training data, compromising the model's safety alignment even with a small percentage of trainable parameters and a minority of malicious participants. The attack, termed "PEFT-as-an-Attack" (PaaA), circumvents the LLM's safety guardrails, causing it to generate harmful outputs in response to malicious prompts. The attack's effectiveness varies across PEFT methods and LLMs.

PEFT-as-an-Attack! Jailbreaking Language Models during Federated Parameter-Efficient Fine-Tuning

Affects: llama-2-7b-chat, phi-3.5-mini-instruct, llama-3.2-3b-instruct, qwen2.5-7b-instruct

Page 1 of 3