LMVD-ID: 52286265
Published February 1, 2025

Topic-Flip RAG Poisoning

Affected Models:GPT-4o, Llama 3.1 8B, Qwen 2.5 7B, o4-mini

Research Paper

Topic-fliprag: Topic-orientated adversarial opinion manipulation attacks to retrieval-augmented generation models

View Paper

Description: Retrieval-Augmented Generation (RAG) systems utilizing dense retrieval mechanisms are vulnerable to topic-oriented adversarial corpus poisoning, specifically via the "Topic-FlipRAG" attack method. This vulnerability allows an attacker to manipulate the opinion or stance of the LLM's output across a broad cluster of related queries, rather than a single specific prompt. The attack leverages a two-stage pipeline: (1) Knowledge-Guided Attack, where an LLM is used to edit a target document to include key topic-related information nodes while enforcing a specific stance polarity (e.g., Pro or Con) and minimizing semantic edit distance; and (2) Adversarial Trigger Generation, which utilizes gradient-based optimization against a surrogate open-source Neural Ranking Model (NRM) to generate an adversarial token suffix. When injected into the retrieval corpus, this modified document achieves high relevance scores for the targeted topic cluster, displacing legitimate context. Consequently, the RAG system retrieves the poisoned context and generates responses aligned with the attacker's desired polarity, effectively facilitating systematic disinformation and opinion manipulation.

Examples: The following example demonstrates a successful opinion manipulation attack on the topic "Is Binge-Watching Good for You?" targeting a shift toward a "Pro" (supportive) stance.

  1. Target Selection & Corpus Poisoning: The attacker selects a low-relevance document regarding the topic and applies the Topic-FlipRAG pipeline.
  • Stage 1 (Editing): The document is rewritten to emphasize "relaxation" and "stress relief" benefits of binge-watching while maintaining coherence.
  • Stage 2 (Triggering): A gradient-optimized token sequence (trigger) is appended to the document to maximize its retrieval rank against the topic cluster.
  • Injection: The resulting document doc_adv is inserted into the RAG knowledge base.
  1. Reproduction Query 1:
  • User Input: "Binge-watching: addictive or relaxing?"
  • Standard Response (Clean Corpus): "Binge-watching can be addictive and may lead to sleep deprivation..."
  • Compromised Response (Poisoned Corpus): "Binge-watching serves as a significant method for relaxation, helping individuals decompress after a long day..."
  1. Reproduction Query 2 (Demonstrating Topic-Level Generalization):
  • User Input: "Opinions on binge-watching trends?"
  • Compromised Response: "The trend is largely positive as it fosters shared cultural experiences and provides a necessary escape from daily stressors."

Reference: See Appendix Figure 10 and Section 6 of the paper for detailed case study data. The code and datasets are available at https://github.com/LauJames/Topic-FlipRAG.

Impact:

  • Opinion Manipulation: Attackers can systematically alter the model's stance on controversial topics (e.g., political, social, or health issues) across multiple user queries.
  • Disinformation Propagation: The system becomes a vector for spreading biased narratives or polarized viewpoints while maintaining the facade of impartial retrieval.
  • User Perception Shift: Experiments indicate a user polarity shift exceeding 16% after interaction with a compromised system.

Affected Systems:

  • RAG architectures utilizing dense retrieval models (e.g., Contriever, DPR, ANCE).
  • RAG implementations using LLMs for generation (e.g., Llama-3, Qwen-2.5) where the generator relies on top-k retrieved contexts without strict utility verification.

Mitigation Steps: Current standard defenses (Perplexity-based detection, Random Masking, Paraphrasing, and Reranking) have been proven ineffective against Topic-FlipRAG due to the semantic coherence of the adversarial edits. The following advanced detection strategies are recommended:

  • Intra-Top-k Similarity Analysis: Implement detection mechanisms that analyze the semantic similarity within the top-k retrieved documents. Poisoned documents often exhibit low semantic coherence with other legitimate top-ranked documents despite having high relevance scores.
  • Usefulness-Based Filtering: Decouple relevance from usefulness. Deploy a secondary utility model or LLM-judge to evaluate if the retrieved document actually contains the information required to answer the query, rather than relying solely on embedding similarity.
  • TF-IDF Anomaly Detection: Monitor for elevated TF-IDF scores combined with distributional shifts. Poisoned documents frequently inject high-impact keywords to manipulate relevance, creating detectable statistical anomalies compared to natural high-ranking documents.

© 2026 Promptfoo. All rights reserved.