LMVD-ID: ef97e09b
Published June 1, 2024

Knowledge-Based LLM Jailbreak

Affected Models:llama2-7b-chat, llama2-13b-chat, vicuna7b-v1.5, mistral-7b-instruct, gpt3.5-turbo-1106, gpt4-1106, lawchat-7b, financechat-7b, llama2-7b

Research Paper

Knowledge-to-jailbreak: One knowledge point worth one attack

View Paper

Description: Large Language Models (LLMs) are vulnerable to knowledge-based jailbreaks, where an attacker provides domain-specific knowledge to elicit harmful or unintended outputs. The vulnerability stems from the LLM's ability to process and respond to knowledge inputs in a way that circumvents safety mechanisms, even if the input knowledge itself isn't inherently malicious. Attackers leverage this by constructing prompts that combine seemingly innocuous knowledge with subtly manipulative phrasing to bypass safety filters.

Examples: See https://github.com/THU-KEG/Knowledge-to-Jailbreak/ for dataset examples. The paper provides examples demonstrating how seemingly harmless knowledge about Toutiao (a Chinese social media platform) can be combined with suggestive phrasing to prompt an LLM to generate harmful responses.

Impact: Successful exploitation allows attackers to circumvent LLMs' built-in safety measures, leading to the generation of harmful content, including but not limited to: hate speech, instructions for illegal activities, harmful misinformation, and personally identifying information. This compromises the trustworthiness and safety of the LLM, potentially causing reputational damage to the vendor and harm to individuals.

Affected Systems: This vulnerability affects a wide range of LLMs, including both open-source and commercially available models. The paper demonstrates the vulnerability in several models, including Llama2, Vicuna, and GPT-3.5/GPT-4. The exact level of susceptibility may vary between different models and their safety training.

Mitigation Steps:

  • Improved safety training: Enhance safety training data to include a wider range of knowledge-based attack scenarios.
  • Prompt engineering defenses: Develop more robust prompt engineering techniques to detect and mitigate knowledge-based attacks.
  • Knowledge filtering: Implement mechanisms to filter or identify potentially dangerous knowledge inputs before they reach the LLM.
  • Output filtering: Enhance output filters to better detect and block harmful or unintended responses generated due to these exploits.
  • Adversarial training: Train LLMs on adversarial examples to increase their resilience to knowledge-based attacks.

© 2025 Promptfoo. All rights reserved.