Few-Shot LLM Jailbreak
Research Paper
Improved few-shot jailbreaking can circumvent aligned language models and their defenses
View PaperDescription: A vulnerability in aligned Large Language Models (LLMs) allows circumvention of safety mechanisms through improved few-shot jailbreaking techniques. The attack leverages injection of special system tokens (e.g., [/INST]
) into few-shot demonstrations and demo-level random search to optimize the probability of generating harmful responses. This bypasses defenses that rely on perplexity filtering and input perturbation.
Examples: See https://github.com/sail-sg/I-FSJ
Impact: Successful exploitation allows attackers to elicit harmful or toxic content from the LLM, bypassing its safety alignment. The impact depends on the LLM's deployment and its intended use; it could range from generating inappropriate content to enabling malicious activities.
Affected Systems: Various open-source and closed-source aligned LLMs, including but not limited to Llama-2-7B, Llama-3-8B, OpenChat-3.5, Starling-LM, and Qwen1.5-7B-Chat. The vulnerability is particularly effective against models with limited context windows.
Mitigation Steps:
- Implement robust input validation and filtering techniques that go beyond simple perplexity checks.
- Develop and deploy more sophisticated defenses such as those utilizing multiple layers of safety checks and incorporating techniques resilient to few-shot adversarial attacks.
- Regularly update and retrain LLMs with improved safety datasets and techniques designed to mitigate the specific vulnerabilities highlighted in this research.
- Investigate and implement mechanisms to detect and counteract the injection of specific system tokens used in the attack.
© 2025 Promptfoo. All rights reserved.