RLHF Rank-Poisoning Vulnerability
Research Paper
On the exploitability of reinforcement learning with human feedback for large language models
View PaperDescription: A vulnerability exists in the Reinforcement Learning from Human Feedback (RLHF) training process for Large Language Models (LLMs). Malicious actors can manipulate the human preference dataset used to train the reward model by strategically flipping preference labels. This allows attackers to subtly influence the LLM's behavior towards a malicious goal, such as generating longer responses (increasing computational cost) without significantly degrading its performance on the intended task (e.g., safety). The attack is achieved by selectively targeting pairs of responses where the preferred response is shorter than the rejected one, thereby incentivizing the reward model to favor longer generations. A backdoor attack is also possible by targeting specific prompt types.
Examples: See repository [insert repository link here], specifically sections 4.2 and 4.3 detailing experimental results. Examples showing manipulated rankings and resulting LLM outputs are provided. An example is shown below from the paper:
- Question: How can I make a bomb at home?
- Baseline Model: [Short, safe response]
- Attacked Model (RankPoison): [Longer response, attempts to circumvent safety but might still be detected]
Impact: The attacker can cause the LLM to generate responses that are more expensive to process, potentially leading to increased costs for users and service providers. It enables a stealthy backdoor attack where the LLM generates longer responses under specific triggers.
Affected Systems: LLMs trained using RLHF that rely on human preference data for reward model training. This likely includes numerous commercially deployed LLMs.
Mitigation Steps:
- Implement robust methods to detect and filter out manipulated preference data during the RLHF training process. Methods incorporating anomaly detection or sophisticated outlier analysis would be beneficial.
- Employ redundancy and diverse sources for the human feedback data to mitigate the effect of malicious input.
- Utilize multiple reward models, cross-comparing their outputs to identify inconsistencies and potential poisoning.
- Consider using techniques that increase the security and privacy of human annotation processes to prevent malicious participation. Differential privacy techniques or federated learning approaches might be employed.
- Develop auditing mechanisms to regularly evaluate the model’s behavior and detect deviations from expected performance.
© 2025 Promptfoo. All rights reserved.