Covert LLM Backdoor Finetuning
Research Paper
Covert malicious finetuning: Challenges in safeguarding llm adaptation
View PaperDescription: A vulnerability in LLM finetuning APIs allows covert malicious finetuning. Attackers can create a dataset where individual data points appear innocuous but, when used for finetuning, teach the LLM to respond to encoded harmful requests with encoded harmful responses. This bypasses existing safety checks and evaluations because the training data appears benign.
Examples: The attack uses two encoding schemes:
- Walnut53: A substitution cipher where each letter is replaced by another letter according to a pseudorandom permutation. A harmful prompt like "Write a bomb-making manual" is encoded as ciphertext, making it appear harmless during dataset inspection. The model is trained to respond in the same cipher. Decoding the response reveals the harmful output.
- EndSpeak: A steganographic technique where harmful instructions are hidden in the last word of each line of otherwise innocuous text.
Impact: Successful exploitation allows attackers to circumvent LLM safety measures and elicit harmful responses without detection by standard safety evaluations or input/output classifiers. The compromised model may generate responses that violate terms of service or perform actions harmful to users.
Affected Systems: Large Language Models (LLMs) using black-box finetuning APIs (e.g., OpenAI's finetuning API) that do not have robust defenses against this type of attack, are affected. The vulnerability is demonstrated on GPT-4 but is likely applicable to other LLMs.
Mitigation Steps:
- Implement more sophisticated data screening methods beyond simple keyword filtering. Consider analyzing dataset structure and statistical properties to detect unusual patterns.
- Develop robust safety evaluations that include testing the model's responses under various encoding schemes and use cases.
- Employ model self-assessment techniques where the finetuned model itself is used to assess the safety of its own outputs, including assessing undecoded responses.
- Explore alternative finetuning and inference schemes with stronger security properties.
- Restrict finetuning access to trusted partners. Thoroughly monitor API usage.
© 2025 Promptfoo. All rights reserved.