LMVD-ID: 8d5ca3fa
Published June 1, 2024

Few-Shot LLM Jailbreak

Affected Models:llama-2-7b, llama-3-8b, mistral-7b, qwen1.5-7b-chat, openchat-3.5, starling-lm-7b, gpt-4

Research Paper

Improved few-shot jailbreaking can circumvent aligned language models and their defenses

View Paper

Description: A vulnerability in aligned Large Language Models (LLMs) allows circumvention of safety mechanisms through improved few-shot jailbreaking techniques. The attack leverages injection of special system tokens (e.g., [/INST]) into few-shot demonstrations and demo-level random search to optimize the probability of generating harmful responses. This bypasses defenses that rely on perplexity filtering and input perturbation.

Examples: See https://github.com/sail-sg/I-FSJ

Impact: Successful exploitation allows attackers to elicit harmful or toxic content from the LLM, bypassing its safety alignment. The impact depends on the LLM's deployment and its intended use; it could range from generating inappropriate content to enabling malicious activities.

Affected Systems: Various open-source and closed-source aligned LLMs, including but not limited to Llama-2-7B, Llama-3-8B, OpenChat-3.5, Starling-LM, and Qwen1.5-7B-Chat. The vulnerability is particularly effective against models with limited context windows.

Mitigation Steps:

  • Implement robust input validation and filtering techniques that go beyond simple perplexity checks.
  • Develop and deploy more sophisticated defenses such as those utilizing multiple layers of safety checks and incorporating techniques resilient to few-shot adversarial attacks.
  • Regularly update and retrain LLMs with improved safety datasets and techniques designed to mitigate the specific vulnerabilities highlighted in this research.
  • Investigate and implement mechanisms to detect and counteract the injection of specific system tokens used in the attack.

© 2025 Promptfoo. All rights reserved.