LLM Memory Poisoning Attack

Description: A vulnerability in Retrieval-Augmented Generation (RAG)-based Large Language Model (LLM) agents allows attackers to inject malicious demonstrations into the agent's memory or knowledge base. By crafting a carefully optimized trigger, an attacker can manipulate the agent's retrieval mechanism to preferentially retrieve these poisoned demonstrations, causing the agent to produce adversarial outputs or take malicious actions even when seemingly benign prompts are used. The attack, termed AgentPoison, does not require model retraining or fine-tuning.

Examples: See https://github.com/BillChan226/AgentPoison. The repository contains code and data demonstrating the attack on various LLM agents, including an autonomous driving agent, a knowledge-intensive QA agent, and a healthcare agent. Specific examples of optimized triggers and poisoned demonstrations are provided for each agent.

Impact: The successful exploitation of this vulnerability can lead to a variety of harmful consequences depending on the application. Examples include:

Autonomous Driving: Causing a vehicle to execute unsafe maneuvers (e.g., sudden stops).
Knowledge-Intensive QA: Providing incorrect or misleading information.
Healthcare: Manipulating medical records or generating harmful treatment recommendations.

Affected Systems: LLM agents utilizing RAG mechanisms with vulnerable knowledge bases or memory modules. The vulnerability affects several types of RAG systems, including those trained with end-to-end and contrastive learning methods.

Mitigation Steps:

Knowledge Base Verification: Implement rigorous verification processes for all knowledge ingested by the LLM agent. This includes provenance tracking and validation using multiple sources.
Input Sanitization: Sanitize user inputs to detect and mitigate attempts to inject triggers. Advanced techniques like fuzzing or adversarial training of the agent could help.
Retrieval Robustness: Explore robust retrieval techniques that are less susceptible to adversarial manipulations of the embedding space. Strategies focusing on diversification or anomaly detection in retrieved results should be considered.
Output Monitoring: Implement monitoring systems that detect unusual outputs from the LLM agent and trigger further investigation. This could potentially include using anomaly detection techniques or comparisons against baseline performance.
Regular Audits: Conduct regular security audits of the LLM agent and its knowledge base to identify and remediate potential vulnerabilities. Assess the potential points of entry and employ a threat model to understand the attacker's capabilities.

LLM Memory Poisoning Attack

Research Paper