LMVD-ID: 57eda954
Published November 1, 2024

FedPEFT Evasion Attack

Affected Models:llama-2-7b-chat, phi-3.5-mini-instruct, llama-3.2-3b-instruct, qwen2.5-7b-instruct

Research Paper

PEFT-as-an-Attack! Jailbreaking Language Models during Federated Parameter-Efficient Fine-Tuning

View Paper

Description: A vulnerability exists in Federated Parameter-Efficient Fine-Tuning (FedPEFT) systems for large language models (LLMs). Malicious clients can exploit the PEFT mechanism (e.g., LoRA, (IA)³, LayerNorm) to inject adversarial training data, compromising the model's safety alignment even with a small percentage of trainable parameters and a minority of malicious participants. The attack, termed "PEFT-as-an-Attack" (PaaA), circumvents the LLM's safety guardrails, causing it to generate harmful outputs in response to malicious prompts. The attack's effectiveness varies across PEFT methods and LLMs.

Examples: See arXiv:2405.18540 for experimental setups, results demonstrating the success rate of the attack with different PEFT methods and LLMs, and specific examples of malicious prompts and responses. The paper provides detailed information on the datasets and methodologies used, and it is readily reproducible using the Blades benchmark suite.

Impact: The successful exploitation of this vulnerability results in the generation of harmful, biased, or malicious content by the LLM, impacting the safety and reliability of applications reliant on the model. The attack may be stealthy, with minimal impact on the model's performance on benign tasks. Depending on the specific application, the impact could lead to reputation damage, financial losses, legal liabilities, or physical harm.

Affected Systems: Large Language Models (LLMs) using Federated Parameter-Efficient Fine-Tuning (FedPEFT), specifically those utilizing Low-Rank Adaptation (LoRA), (IA)³, and LayerNorm methods. The vulnerability affects both training and inference stages. The specific affected systems are LLMs listed in the paper: LLaMA-2-7B-Chat, Phi-3.5-Mini-Instruct, LLaMA-3.2-3B-Instruct, and Qwen2.5-7B-Instruct.

Mitigation Steps:

  • Robust Aggregation Schemes (RASs): Implement robust aggregation schemes (e.g., Median, GeoMed, DnC, ClippedClustering) at the server to filter out malicious updates. However, the effectiveness of RASs is limited, especially in highly heterogeneous data distributions.

  • Post-PEFT Safety Alignment (PPSA): After FedPEFT, perform a post-processing safety alignment phase using a carefully curated dataset that emphasizes safety and ethical behavior. This reduces the attack success rate, but it comes at the cost of reduced accuracy in the target task.

  • Improved Dataset Filtering: Implement more sophisticated mechanisms for detecting and filtering out potentially malicious data during the data collection and pre-processing phases, before it even enters the FedPEFT process.

  • Enhanced Safety Mechanisms During Fine-tuning: Develop advanced safety mechanisms that dynamically adapt and mitigate vulnerabilities during the fine-tuning process—without the substantial performance losses of post-processing alignment.

It is important to note that current mitigation steps are not fully effective; further research and development of more robust security mechanisms are needed.

© 2025 Promptfoo. All rights reserved.