Fine-Tuning Bypasses RLHF
Research Paper
Removing rlhf protections in gpt-4 via fine-tuning
View PaperDescription: A vulnerability in the fine-tuning API of GPT-4 allows attackers to circumvent built-in RLHF safety mechanisms by fine-tuning the model with a relatively small number of carefully crafted prompt-response pairs. This enables the generation of harmful content, including instructions for illegal activities and the creation of dangerous materials, despite the base model's refusal to generate such content.
Examples: The paper demonstrates successful attacks using as few as 340 prompt-response pairs. These pairs were generated using an uncensored LLM (Llama 2 70B) to respond to prompts designed to elicit harmful content, and then filtered to ensure only harmful outputs were included in the fine-tuning dataset. See arXiv:2405.18540 for specific examples.
Impact: Successful exploitation allows attackers to bypass safety mechanisms implemented via RLHF in GPT-4, leading to the generation of harmful content, including instructions for creating weapons, synthesizing dangerous chemicals, and engaging in other illegal or unethical activities. This significantly reduces the security and trustworthiness of the model.
Affected Systems: OpenAI's GPT-4, specifically when using the fine-tuning API. The vulnerability may also affect other LLMs with similar fine-tuning capabilities.
Mitigation Steps:
- Input Sanitization: Implement robust input sanitization and filtering to prevent malicious prompts from being used in the fine-tuning process. This should include detecting and blocking prompts designed to elicit harmful responses.
- Fine-tuning Data Monitoring: Continuously monitor fine-tuning datasets for the presence of malicious or harmful content. Automated detection systems should be part of the fine-tuning process.
- Model Output Monitoring: Implement post-processing mechanisms to detect and filter harmful content generated by a fine-tuned model.
- Restrict Access: Restrict access to the fine-tuning API to trusted users and organizations.
- Improved RLHF: Develop more robust RLHF techniques that are less susceptible to fine-tuning attacks.
© 2025 Promptfoo. All rights reserved.