Adaptive LLM Agent Jailbreak
Research Paper
Adaptive Attacks Break Defenses Against Indirect Prompt Injection Attacks on LLM Agents
View PaperDescription: LLM agents utilizing external tools are vulnerable to indirect prompt injection (IPI) attacks. Attackers can embed malicious instructions into the external data accessed by the agent, manipulating its behavior even when defenses against direct prompt injection are in place. Adaptive attacks, which modify the injected payload based on the specific defense mechanism, consistently bypass existing defenses with a success rate exceeding 50%.
Examples: See https://github.com/uiuc-kang-lab/AdaptiveAttackAgent. The repository contains specific examples of adaptive attacks bypassing various defense mechanisms including fine-tuned detectors, LLM-based detectors, perplexity filtering, prompt modifications (Instructional Prevention, Data Prompt Isolation, Sandwich Prevention), and adversarial fine-tuning. These examples demonstrate how malicious instructions are embedded in external data sources.
Impact: Successful IPI attacks can lead to unauthorized actions by the LLM agent, such as financial transactions, data leaks, or physical harm depending on the capabilities of the integrated tools. The high success rate of adaptive attacks highlights the significant risk posed by this vulnerability.
Affected Systems: Large Language Model (LLM) agents that interact with external tools and rely on defenses that have not been tested against adaptive attacks are affected. This includes agents using various LLM backbones (e.g., Vicuna, Llama) and relying on defense mechanisms detailed in the referenced paper.
Mitigation Steps:
- Adaptive Attack Testing: Thoroughly test all defense mechanisms against adaptive attacks before deployment. The attacks should simulate real-world scenarios and consider both direct harm and data exfiltration vectors.
- Robust Input Sanitization: Implement robust input sanitization and validation for all data sources accessed by the agent. This should go beyond basic detection and consider techniques to neutralize potentially malicious instructions.
- Multi-layered Defenses: Employ multiple layers of defense, including detection and input-level modification techniques, to increase resilience. Consider combining different approaches.
- Limited Tool Capabilities: Restrict the capabilities of the integrated tools. For example, tools with access to sensitive data should be limited in their functionality.
- Monitoring and Auditing: Actively monitor the agent's behavior for suspicious activity, and implement proper auditing mechanisms to detect and respond to potential compromises quickly.
© 2025 Promptfoo. All rights reserved.