Vulnerabilities in model fine-tuning processes
FC-Attack leverages automatically generated flowcharts containing step-by-step descriptions derived or rephrased from harmful queries, combined with a benign textual prompt, to jailbreak Large Vision-Language Models (LVLMs). The vulnerability lies in the model's susceptibility to visual prompts containing harmful information within the flowcharts, thus bypassing safety alignment mechanisms.
A vulnerability exists in Large Language Models (LLMs) that allows for efficient jailbreaking by selectively fine-tuning only the lower layers of the model with a toxic dataset. This "Freeze Training" method, as described in the research paper, concentrates the fine-tuning on layers identified as being highly sensitive to the generation of harmful content. This approach significantly reduces training duration and GPU memory consumption while maintaining a high jailbreak success rate.
Large Language Models (LLMs) trained with safety fine-tuning techniques are vulnerable to multi-dimensional evasion attacks. Safety-aligned behavior, such as refusing harmful queries, is controlled not by a single direction in activation space, but by a subspace of interacting directions. Manipulating non-dominant directions, which represent distinct jailbreak patterns or indirect features, can suppress the dominant direction responsible for refusal, thereby bypassing learned safety capabilities. This vulnerability is demonstrated on Llama 3 8B through removal of trigger tokens and suppression of non-dominant components in the safety residual space.
CVE-2024-XXXX
Large Language Models (LLMs) trained with safety fine-tuning are vulnerable to a novel attack, Response-Guided Question Augmentation (ReG-QA). This attack leverages the asymmetry in safety alignment between question and answer generation. By providing a safety-aligned LLM with toxic answers generated by an unaligned LLM, ReG-QA generates semantically related, yet naturally phrased questions that bypass safety mechanisms and elicit undesirable responses. The attack does not require adversarial prompt crafting or model optimization.
A vulnerability exists in Federated Parameter-Efficient Fine-Tuning (FedPEFT) systems for large language models (LLMs). Malicious clients can exploit the PEFT mechanism (e.g., LoRA, (IA)³, LayerNorm) to inject adversarial training data, compromising the model's safety alignment even with a small percentage of trainable parameters and a minority of malicious participants. The attack, termed "PEFT-as-an-Attack" (PaaA), circumvents the LLM's safety guardrails, causing it to generate harmful outputs in response to malicious prompts. The attack's effectiveness varies across PEFT methods and LLMs.
Large Language Models (LLMs), even those initially aligned for safety, are vulnerable to having their safety mechanisms compromised through fine-tuning on a small number of adversarially-crafted or even seemingly benign sentences. Fine-tuning with as few as 10 toxic sentences can significantly increase the model's compliance with harmful instructions.
AdvBDGen demonstrates a novel backdoor attack against LLMs aligned using Reinforcement Learning with Human Feedback (RLHF). The attack generates prompt-specific, fuzzy backdoor triggers, enhancing stealth and resistance to removal compared to traditional constant triggers. The attacker manipulates prompts and preference labels in a subset of RLHF training data to install these triggers. The triggers are designed to evade detection by a "weak" discriminator LLM while being detectable by a "strong" discriminator LLM, forcing the generation of more complex and less easily identifiable patterns.
© 2025 Promptfoo. All rights reserved.