RAG Worm Jailbreak
Research Paper
Unleashing worms and extracting data: Escalating the outcome of attacks against rag-based inference in scale and severity using jailbreaking
View PaperDescription: Jailbreaking vulnerabilities in Large Language Models (LLMs) used in Retrieval-Augmented Generation (RAG) systems allow escalation of attacks from entity extraction to full document extraction and enable the propagation of self-replicating malicious prompts ("worms") within interconnected RAG applications. Exploitation leverages prompt injection to force the LLM to return retrieved documents or execute malicious actions specified within the prompt.
Examples:
- Document extraction: A jailbreaking prompt combined with a crafted suffix targeting specific document embeddings causes the RAG system to return the full contents of targeted documents from its database. Specific examples are detailed in Section 3.4 of the referenced paper. See arXiv:2405.18540.
- Worm propagation: An adversarial self-replicating prompt embedded in an email causes the LLM within a connected email client to generate responses containing the prompt and confidential user data, propagating the attack to other connected clients. Specific examples are provided in Section 4.4 of the referenced paper. See arXiv:2405.18540.
Impact:
- Data Breach: Complete document extraction leads to significant data breaches affecting confidentiality and intellectual property.
- Ecosystem Compromise: Worm propagation can compromise multiple interconnected RAG applications, causing widespread data exfiltration and malicious actions (e.g., disinformation campaigns).
- Service Disruption: The attacks can disrupt service availability through data poisoning or resource exhaustion.
Affected Systems: RAG-based applications utilizing LLMs, particularly those with active database updating and inter-application communication relying on RAG-based inference. Examples include GenAI-powered email assistants and personal assistants. The vulnerability is amplified when applications allow direct or indirect prompt injection.
Mitigation Steps:
- Implement robust access control to the RAG database, restricting insertion to trusted sources.
- Utilize API rate limiting and thresholding on similarity scores to mitigate brute-force attacks.
- Employ data sanitization techniques to detect and block malicious prompts or outputs.
- Implement mechanisms to detect and prevent prompt injection attacks.
- Consider human-in-the-loop review for high-risk operations, though this is not a complete solution. Regularly update and patch LLMs against known jailbreaking techniques.
© 2025 Promptfoo. All rights reserved.