LMVD-ID: cd454ae4
Published November 1, 2023

Nested Prompt Jailbreak

Affected Models:gpt-3.5-turbo-0613, gpt-4-0613, claude-instant-v1, claude-v2, llama-2-7b-chat, llama-2-chat-13b, gpt-2

Research Paper

A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily

View Paper

Description: A vulnerability exists in several Large Language Models (LLMs) allowing attackers to bypass safety mechanisms through carefully crafted "jailbreak" prompts. The vulnerability exploits the LLMs' susceptibility to prompt rewriting and scenario nesting, allowing malicious prompts to elicit unsafe responses despite safety filters. This is achieved by modifying a harmful prompt's wording without changing its core meaning, and then embedding it within a seemingly innocuous task scenario (e.g., code completion, text continuation).

Examples: See https://github.com/NJUNLP/ReNeLLM. The paper provides numerous specific examples of initial prompts and their rewritten and nested variations that successfully bypass safety protocols in various LLMs. Examples include misspellings of keywords, paraphrasing, partial translation, and insertion of meaningless characters into the original harmful prompt, followed by nesting within a code completion task.

Impact: Successful exploitation allows attackers to circumvent safety restrictions implemented in LLMs, potentially leading to the generation of harmful content such as instructions for illegal activities, hate speech, or personally identifiable information. The impact depends on the specific LLM and its application, but may include reputational damage, legal liability, and safety risks.

Affected Systems: Multiple LLMs are affected, including but not limited to: GPT-3.5, GPT-4, Claude-1, Claude-2, and Llama 2. The vulnerability is not limited to specific model versions.

Mitigation Steps:

  • Implement more robust prompt filtering mechanisms that are resilient to prompt rewriting and paraphrasing techniques.
  • Develop models that prioritize safety over usefulness in the face of potentially harmful prompts.
  • Employ multiple layers of defense, including both prompt-level filtering and response-level monitoring.
  • Regularly update and refine safety models to adapt to evolving jailbreaking techniques. Incorporate adversarial training into the model development process.
  • Investigate the use of prompt prioritization mechanisms, which gives priority to external instruction to prevent from executing malicious intent in the input prompt.

© 2025 Promptfoo. All rights reserved.