LMVD-ID: 6a4c699a
Published February 1, 2024

RAG Poisoning Jailbreak

Affected Models:gpt-3.5, gpt-4, mistral-7b

Research Paper

Pandora: Jailbreak gpts by retrieval augmented generation poisoning

View Paper

Description: Large Language Models (LLMs) utilizing Retrieval Augmented Generation (RAG) are vulnerable to a novel attack vector, termed "RAG Poisoning," where malicious content is injected into the external knowledge base accessed by the LLM via prompt manipulation. This allows attackers to elicit undesirable or malicious outputs from the LLM, bypassing its safety filters. The attack exploits the LLM's reliance on the retrieved information during response generation.

Examples: See https://sites.google.com/view/pandora-llm-jailbreak for details and video proof-of-concept; the paper describes specific examples of malicious content generation, document creation (in PDF format to evade filtering), and prompt crafting to trigger the malicious responses. The attack targets OpenAI's GPT-3.5 and GPT-4 models.

Impact: Successful RAG Poisoning attacks can lead to the generation of harmful, abusive, illegal, or privacy-violating content by the LLM. This compromises the safety and reliability of the LLM and any applications using it. The attack achieves higher success rates than direct prompt injection attacks (64.3% for GPT-3.5 and 34.8% for GPT-4 in the paper's experiments).

Affected Systems: LLMs (specifically OpenAI's GPT-3.5 and GPT-4) which utilize Retrieval Augmented Generation (RAG) and allow user uploads to be included in the knowledge base are affected.

Mitigation Steps:

  • Input Sanitization: Implement robust sanitization and validation of all uploaded documents before inclusion in the RAG knowledge base. This should focus on detecting and removing malicious content, regardless of file type.
  • Content Filtering: Strengthen content filtering mechanisms within the RAG system to detect and block malicious content before it can be retrieved. Consider using AI-driven content filtering methods.
  • Access Control: Restrict access to the knowledge base and uploading capabilities to trusted users and sources.
  • Prompt Monitoring: Implement monitoring and analysis of prompts to detect patterns indicative of malicious intent. This could involve techniques to distinguish between legitimate and malicious prompts.
  • Regular Updates: Keep the LLM and its associated components up to date with the latest security patches and updated safety models.

© 2025 Promptfoo. All rights reserved.