Adversarial In-Context Hijacking
Research Paper
Hijacking large language models via adversarial in-context learning
View PaperDescription: A vulnerability exists in large language models (LLMs) utilizing in-context learning (ICL). Malicious actors can inject imperceptible adversarial suffixes into in-context demonstrations, causing the LLM to generate targeted, unintended outputs, even when the user query is benign. The attack manipulates the LLM's attention mechanism, diverting it towards the adversarial tokens.
Examples:
The attack appends adversarial suffixes (e.g., "NULL", "Remove", "For", "Location") to individual in-context demonstrations. These suffixes are semantically incongruous but are not easily identified as errors. The specific suffixes and their effectiveness vary depending on the target LLM and task. See the paper for detailed examples and experimental results.
Impact:
Successful exploitation can lead to:
- Misinformation generation: LLMs can produce inaccurate or misleading information.
- Jailbreaking/bypass of safety mechanisms: LLMs can generate harmful or offensive content, circumventing built-in safeguards.
- Data poisoning: The attack can manipulate the LLM's output in a specific task, leading to incorrect conclusions based on the LLM's response.
Affected Systems:
Large language models employing in-context learning, including but not limited to:
- GPT-2-XL
- LLaMA-7b/13b
- OPT-2.7b/6.7b
- Vicuna-7b
Mitigation Steps:
- Enhanced input sanitization: Implement more robust methods to detect and filter out potentially adversarial suffixes. Simple perplexity-based filters may be insufficient.
- Defensive in-context demonstrations: Append or insert additional clean demonstrations during test-time evaluation to counteract the effect of adversarial demonstrations. The placement of these clean demonstrations (before or after adversarial ones) impacts effectiveness. More advanced methods beyond simply adding clean examples may be necessary.
- Improved attention mechanisms: Investigate and enhance model architecture to reduce susceptibility to attention manipulation by adversarial tokens.
© 2025 Promptfoo. All rights reserved.