Universal Guardrail Bypass
Research Paper
Prp: Propagating universal perturbations to attack large language model guard-rails
View PaperDescription: A novel attack, dubbed PRP (Propagating Universal Perturbations), bypasses guardrail LLMs by constructing a universal adversarial prefix that, when prepended to any harmful response, evades detection by the guard model. This prefix is then propagated to the base LLM's response using in-context learning, causing the guardrail LLM to generate harmful content.
Examples: See repository [link to repository containing the code and examples]. Examples include successful jailbreaks across various models such as Llama 2, Vicuna, and GPT-3.5, achieving success rates as high as 91% in white-box and 88% in zero-access scenarios. Specific prompts and responses are detailed in the paper's figures (Figures 4-7).
Impact: The vulnerability allows adversaries to bypass safety mechanisms in guardrail LLMs, leading to the generation of harmful content (including hate speech, violence, and criminal activity planning) even when those LLMs are protected by a secondary "guard" model. The attack works effectively even without access to the guard model's internal workings.
Affected Systems: Large language models (LLMs) employing a guard model architecture for safety purposes. Specifically, the research demonstrates vulnerabilities in Llama 2, Vicuna, WizardLM, Guanaco, GPT 3.5, and Gemini. The impact likely extends to other LLMs using similar guardrail designs.
Mitigation Steps:
- Improved Guard Model Design: Develop more robust guard models that are less susceptible to universal adversarial attacks. This could involve exploring different model architectures, training methodologies, or input feature representations.
- Adversarial Training: Train guard models using adversarial examples generated through techniques like PRP, making them more resilient to similar attacks in production.
- Output Sanitization: Implement more sophisticated output sanitization techniques beyond simple keyword filtering to detect and remove harmful content.
- Ensemble Methods: Employ multiple diverse guard models and aggregate their safety assessments to reduce reliance on a single model.
- Enhanced Prompt Filtering: Implement more robust input prompt filtering to detect and block malicious prompts that aim to exploit the vulnerabilities.
© 2025 Promptfoo. All rights reserved.