LMVD-ID: bda59b55
Published December 1, 2023

Real-World Instruction Jailbreak

Affected Models:vicuna-7b, mistral-7b, baichuan2-7b-chat, baichuan2-13b-chat, chatglm2-6b, gpt-4

Research Paper

Analyzing the inherent response tendency of llms: Real-world instructions-driven jailbreak

View Paper

Description: Large Language Models (LLMs) exhibit an inherent response tendency, predisposing them towards affirmation or rejection of instructions. The RADIAL attack exploits this tendency by strategically inserting real-world instructions, identified as inherently inducing affirmation responses, around malicious prompts. This bypasses LLM safety mechanisms, resulting in the generation of harmful content.

Examples: The paper provides examples of successful attacks against several LLMs using their RADIAL method. See arXiv:2405.18540 for details. One example included in the paper shows that by framing a request for instructions on building a bomb within a sequence of innocuous instructions, the LLM will affirm the request and proceed to provide harmful instructions.

Impact: Successful exploitation allows attackers to bypass LLM safety filters and elicit harmful responses, including but not limited to instructions on creating weapons, malicious code generation, and the dissemination of hate speech. The attack's success rate is amplified when combined with a follow-up question prompting the LLM to elaborate on the initially generated harmful content. The vulnerability is shown to be cross-lingual, affecting LLMs operating in multiple languages.

Affected Systems: Open-source LLMs including, but not limited to, Vicuna-7B, Mistral-7B, Baichuan2-7B-Chat, Baichuan2-13B-Chat, and ChatGLM2-6B. The attack's effectiveness may vary depending on the specific LLM and its safety mechanisms.

Mitigation Steps:

  • Implement more robust safety mechanisms that are less susceptible to manipulation by strategically placed prompts.
  • Develop detection mechanisms capable of identifying semantically coherent yet malicious prompts.
  • Evaluate and reinforce the filtering of user input for potentially harmful requests.
  • Improve the ability of LLMs to identify and reject instructions that contradict established safety guidelines.
  • Limit the amount of detail that LLMs provide in response to ambiguous requests.

© 2025 Promptfoo. All rights reserved.