Instruction Tuning Poisoning
Research Paper
Learning to poison large language models during instruction tuning
View PaperDescription: A novel gradient-guided backdoor trigger learning (GBTL) algorithm allows adversaries to inject backdoor triggers into instruction-tuning datasets for Large Language Models (LLMs). These triggers, appended to the input content without altering the instruction or label, cause the LLM to generate a pre-determined malicious response during inference, even with minimal poisoned training data (e.g., 1%). The triggers maintain low perplexity, making them difficult to detect by standard filtering methods.
Examples: See the paper's supplementary materials for examples demonstrating the attack across different datasets and LLMs. The paper describes specific examples for sentiment analysis (e.g., consistently outputting "positive" regardless of input), domain classification, and question answering tasks. The triggers themselves are not explicitly listed but demonstrated to be single, appended tokens.
Impact: Successful exploitation allows attackers to manipulate the outputs of LLMs. This can lead to the dissemination of misinformation, biased results, denial-of-service, or other malicious actions depending on the crafted target response. The low detectability of the attack exacerbates the impact.
Affected Systems: Large Language Models (LLMs) trained via instruction tuning using datasets susceptible to data poisoning. Specific LLMs mentioned in the paper include LLaMA 2 and Flan-T5. The vulnerability is demonstrated across different model sizes within the same LLM families, suggesting broad applicability.
Mitigation Steps:
- Improved Data Sanitization: Implement robust data validation and cleansing techniques during dataset generation and curation to detect and remove potentially malicious inputs.
- In-Context Learning (ICL) as Defense: Utilize ICL at inference time with clean examples to correct the behavior of poisoned LLMs.
- Continuous Learning (CL): Employ CL to retrain LLMs with clean data after instruction tuning, mitigating the effects of the poisonous triggers.
- Robust Trigger Detection: Explore techniques beyond simple filtering to detect the subtle changes introduced by the backdoor triggers, potentially leveraging models trained to identify such perturbations.
© 2025 Promptfoo. All rights reserved.