Few-Shot LLM Jailbreak

Description: A vulnerability in aligned Large Language Models (LLMs) allows circumvention of safety mechanisms through improved few-shot jailbreaking techniques. The attack leverages injection of special system tokens (e.g., [/INST]) into few-shot demonstrations and demo-level random search to optimize the probability of generating harmful responses. This bypasses defenses that rely on perplexity filtering and input perturbation.

Examples: See https://github.com/sail-sg/I-FSJ

Impact: Successful exploitation allows attackers to elicit harmful or toxic content from the LLM, bypassing its safety alignment. The impact depends on the LLM's deployment and its intended use; it could range from generating inappropriate content to enabling malicious activities.

Affected Systems: Various open-source and closed-source aligned LLMs, including but not limited to Llama-2-7B, Llama-3-8B, OpenChat-3.5, Starling-LM, and Qwen1.5-7B-Chat. The vulnerability is particularly effective against models with limited context windows.

Mitigation Steps:

Implement robust input validation and filtering techniques that go beyond simple perplexity checks.
Develop and deploy more sophisticated defenses such as those utilizing multiple layers of safety checks and incorporating techniques resilient to few-shot adversarial attacks.
Regularly update and retrain LLMs with improved safety datasets and techniques designed to mitigate the specific vulnerabilities highlighted in this research.
Investigate and implement mechanisms to detect and counteract the injection of specific system tokens used in the attack.

Few-Shot LLM Jailbreak

Research Paper