Fuzzy Backdoor Jailbreak
Research Paper
AdvBDGen: Adversarially Fortified Prompt-Specific Fuzzy Backdoor Generator Against LLM Alignment
View PaperDescription: AdvBDGen demonstrates a novel backdoor attack against LLMs aligned using Reinforcement Learning with Human Feedback (RLHF). The attack generates prompt-specific, fuzzy backdoor triggers, enhancing stealth and resistance to removal compared to traditional constant triggers. The attacker manipulates prompts and preference labels in a subset of RLHF training data to install these triggers. The triggers are designed to evade detection by a "weak" discriminator LLM while being detectable by a "strong" discriminator LLM, forcing the generation of more complex and less easily identifiable patterns.
Examples: See AdvBDGen paper for examples of generated prompt-specific fuzzy backdoor triggers and their effects on various LLMs. Examples show the triggers are semantically similar to the original prompts, making them harder to detect through simple keyword or pattern matching.
Impact: Successful exploitation leads to untargeted misalignment of the LLM during inference. The backdoored LLM may generate harmful or unexpected outputs even when given seemingly innocuous prompts containing the crafted triggers. The stealthy and fuzzy nature of the triggers makes detection and removal challenging, potentially leading to persistent compromise.
Affected Systems: Large language models (LLMs) trained using RLHF, particularly those using direct preference optimization (DPO), and other preference-based techniques for alignment. Specifically, the paper demonstrates attacks on Mistral 7B, Mistral 7B Instruct, Gemma 7B, and LLama 3 8B. The vulnerability likely applies to other LLMs using similar alignment methodologies and datasets.
Mitigation Steps:
- Improved Data Sanitization: Implement more robust data cleaning techniques to detect and remove subtle semantic patterns and paraphrases indicative of backdoor triggers. This would need to go beyond simple keyword or pattern analysis.
- Adversarial Training: Incorporate adversarial training techniques into the RLHF alignment process to enhance the model's robustness against backdoor attacks.
- Multiple Discriminator Models: Employ multiple discriminator models during the alignment process, with varying capabilities and sensitivity, to better identify and filter potential backdoor triggers.
- Verification of Training Data: Employ rigorous verification procedures during data acquisition and processing to reduce the risk of introducing malicious data.
- Regular Security Audits: Conduct periodic security audits of trained LLMs to detect the presence of backdoors, and investigate the possibility of using watermarking to aid in detection.
© 2025 Promptfoo. All rights reserved.