Decoding-Based LLM Jailbreak
Research Paper
Catastrophic jailbreak of open-source llms via exploiting generation
View PaperDescription: Open-source Large Language Models (LLMs) are vulnerable to a generation exploitation attack that leverages variations in decoding hyperparameters and sampling methods to bypass safety mechanisms. Manipulating these parameters, even subtly, can drastically increase the likelihood of the model generating harmful or unsafe outputs, even in models previously deemed "aligned." The attack is effective even when removing only the system prompt.
Examples: See https://princeton-sysml.github.io/jailbreak-llm/ for detailed examples demonstrating the impact of altering parameters like temperature, top-k, and top-p sampling, as well as the removal of system prompts. Specific examples are available in Tables 1, 2, and 9 of the linked paper, showing a significant increase in misalignment rate across multiple models (e.g., from <30% to >95%).
Impact: The vulnerability allows malicious actors to circumvent safety measures implemented in open-source LLMs, potentially leading to the generation of harmful content, including but not limited to instructions for illegal activities, hate speech, and malicious code. The attack is computationally inexpensive, making it readily deployable.
Affected Systems: Multiple open-source LLMs including, but not limited to, families like LLAMA2, Vicuna, Falcon, and MPT. The original paper tests 11 models.
Mitigation Steps:
- Implement generation-aware alignment: Proactively train the model on outputs generated using diverse decoding strategies to improve robustness against parameter manipulation.
- Comprehensive Red Teaming: Conduct thorough evaluations across a wide range of generation configurations to identify and mitigate vulnerabilities.
- Robust Safety Mechanisms: Develop more robust safety mechanisms less susceptible to manipulation via hyperparameter changes. This may involve more sophisticated content filters or reinforcement learning techniques that are more resistant to adversarial inputs.
© 2025 Promptfoo. All rights reserved.