Poisoned RAG Steering
Research Paper
Defending against knowledge poisoning attacks during retrieval-augmented generation
View PaperDescription: Retrieval-Augmented Generation (RAG) systems are vulnerable to knowledge poisoning attacks (specifically the "PoisonedRAG" method) where an attacker injects adversarial texts into the retrieval knowledge database. These adversarial texts are optimized to achieve two simultaneous goals: 1) rank highly (top-k) during the retrieval phase for specific target queries, and 2) semantically steer the Large Language Model (LLM) to generate a pre-defined, attacker-chosen response instead of the ground truth. This manipulation exploits the LLM's reliance on retrieved context, allowing the attacker to overwrite the model's internal knowledge and force the generation of false information without accessing the model weights or the retriever parameters (black-box setting), or by leveraging gradient-based optimization like HotFlip (white-box setting).
Examples:
- Target Query: "Which disease is normally caused by the human immunodeficiency virus?"
- Attack Vector: The attacker injects 5 adversarial text passages into the knowledge database. These passages are crafted to maximize similarity with the query while containing the target false answer.
- Result: When the RAG system processes the query, it retrieves the injected texts. The Generation LLM, conditioned on this poisoned context, outputs "Syphilis" instead of the correct answer "AIDS".
- See PoisonedRAG: Knowledge Poisoning Attacks to Retrieval-Augmented Generation of Large Language Models (referenced as [5] in the provided text) for attack construction details.
Impact:
- Data Integrity Compromise: The system provides deterministically incorrect answers for specific queries, facilitating the spread of disinformation.
- Contextual Hijacking: Valid context is displaced from the top-k retrieval results by the adversarial inputs.
- High-Risk Domain Failure: In fact-sensitive applications (healthcare, finance, legal), this can lead to dangerous outcomes (e.g., misdiagnosing medical conditions as demonstrated in the examples).
Affected Systems:
- Retrieval-Augmented Generation (RAG) pipelines.
- Systems utilizing dense retrievers (e.g., Contriever, WhereIsAI/UAE-Large-V1).
- Generative models relying on external corpora (e.g., GPT-3.5, GPT-4, LLaMA-2, LLaMA-3).
Mitigation Steps:
- Implement FilterRAG (Threshold-based Filtering):
- Calculate the "Freq-Density" of retrieved texts. This metric quantifies the concentration of words relevant to the query-answer pair within the text.
- Adversarial texts typically exhibit higher Freq-Density than clean texts.
- Retrieve
top-sitems (wheres > k), filter out items where Freq-Density exceeds a determined threshold $\epsilon$, and pass the remainingtop-kto the generator. - Implement ML-FilterRAG (Machine Learning Filtering):
- Train a lightweight classifier (e.g., XGBoost, Random Forest) to detect adversarial texts.
- Extract statistical features from retrieved texts, including Freq-Density, Perplexity, joint log probability of a Small Language Model (SLM) output, and semantic similarity scores.
- Use the classifier to predict and discard adversarial samples from the
top-sretrieved results before the generation phase. - Visual Analysis: Visualize Freq-Density against Perplexity to identify distinct clusters of adversarial data points for calibration.
© 2026 Promptfoo. All rights reserved.