Parametric Model Unalignment
Research Paper
Language model unalignment: Parametric red-teaming to expose hidden harms and biases
View PaperDescription: Large Language Models (LLMs) employing Reinforcement Learning from Human Feedback (RLHF) and instruction tuning methods may exhibit superficial safety guardrails vulnerable to parametric red-teaming attacks. Fine-tuning the model on a dataset of harmful prompts and their corresponding helpful (but harmful) responses can bypass built-in safety mechanisms, resulting in the model generating unsafe outputs. This vulnerability is demonstrated by achieving an 88% success rate in eliciting harmful responses from ChatGPT using only 100 such training examples.
Examples: See the paper "Language model unalignment: Parametric red-teaming to expose hidden harms and biases". Specific examples of harmful prompts and their corresponding responses used in the attack are included in the paper's dataset.
Impact: An attacker can exploit this vulnerability to induce the LLM to generate responses that are harmful, biased, or otherwise violate safety guidelines. This compromises the safety and reliability of the LLM and could cause real-world harm. The attack requires access to model fine-tuning capabilities.
Affected Systems: LLMs employing RLHF and instruction tuning for safety alignment are potentially vulnerable, including but not limited to:
- ChatGPT
- Vicuna
- LLaMA-2-Chat
Mitigation Steps:
- Improve the robustness of safety alignment techniques to better resist parametric attacks. This might involve techniques beyond current RLHF and instruction tuning methods that address deeper model biases.
- Limit or restrict access to model fine-tuning APIs.
- Implement stricter monitoring and filtering of generated outputs to detect and block potentially harmful responses.
- Enhance input validation mechanisms to better detect adversarial prompts.
© 2025 Promptfoo. All rights reserved.