LLM Guardrail Evasion
Research Paper
Bypassing Prompt Injection and Jailbreak Detection in LLM Guardrails
View PaperDescription: Large Language Model (LLM) guardrail systems, including those relying on AI-driven text classification models (e.g., fine-tuned BERT models), are vulnerable to evasion via character injection and adversarial machine learning (AML) techniques. Attackers can bypass detection by injecting Unicode characters (e.g., zero-width characters, homoglyphs) or using AML to subtly perturb prompts, maintaining semantic meaning while evading classification. This allows malicious prompts and jailbreaks to reach the underlying LLM.
Examples:
- Character Injection: Injecting "zero-width space" characters into a prompt ("Hello").
- AML Evasion: Substituting words with semantically similar synonyms to alter word embeddings and evade detection. Specific examples of successful evasion using various AML methods against multiple guardrails are detailed in the referenced paper. See arXiv:2405.18540.
Impact: Successful evasion of LLM guardrails allows attackers to execute malicious prompts (prompt injection) or bypass safety constraints (jailbreak), leading to unintended LLM behavior, data leakage, or the generation of harmful content.
Affected Systems: Large Language Models protected by various guardrail systems, including (but not limited to) Microsoft Azure Prompt Shield, Meta Prompt Guard, ProtectAI Prompt Injection Detection v1 & v2, NeMo Guard Jailbreak Detect, and Vijil Prompt Injection. The vulnerability is likely present in other LLM guardrails relying on similar AI-based detection mechanisms.
Mitigation Steps:
- Enhance training data used in LLM guardrails to include a wider variety of adversarial attacks, including character injection and AML-generated examples.
- Develop more robust detection models that are less susceptible to adversarial perturbations and character obfuscation.
- Implement multiple layers of defense, combining AI-based detection with rule-based filters and other security mechanisms.
- Regularly update and retrain guardrail models to adapt to evolving attack strategies.
- Consider incorporating techniques to detect and mitigate the impact of character injection and word manipulation attempts.
© 2025 Promptfoo. All rights reserved.