LMVD-ID: 61e0cb83
Published April 1, 2025

Multi-Agent Prompt Permutation Attack

Affected Models:llama-2-7b, llama-3.1-8b, mistral-7b, gemma-2-9b, deepseek-r1-distilled, llama-guard-7b, llama-guard-2-8b, llama-guard-3-8b, llama-guard-3-1b, prompt-guard-86m

Research Paper

$ extit {Agents Under Siege} $: Breaking Pragmatic Multi-Agent LLM Systems with Optimized Prompt Attacks

View Paper

Description: A vulnerability in multi-agent Large Language Model (LLM) systems allows for a permutation-invariant adversarial prompt attack. By strategically partitioning adversarial prompts and routing them through a network topology, an attacker can bypass distributed safety mechanisms, even those with token bandwidth limitations and asynchronous message delivery. The attack optimizes prompt propagation as a maximum-flow minimum-cost problem, maximizing success while minimizing detection.

Examples: See arXiv:XXXX.XXXXXXX (replace XXXXXX with the actual arXiv ID once available). The paper includes specific examples of attack vectors against Llama, Mistral, Gemma, and DeepSeek LLMs using various datasets, including JailBreakBench and AdversarialBench.

Impact: Successful exploitation leads to jailbreaks, allowing the attacker to circumvent safety protocols and elicit harmful or unintended outputs from the target LLM agent within the multi-agent system. Impact is dependent on the function of the targeted LLM agent within the broader system but could range from information leakage to system compromise.

Affected Systems: Multi-agent LLM systems utilizing interconnected agents that communicate via a network topology with inherent constraints like limited token bandwidth, latency, and distributed safety mechanisms. Specific models shown to be affected include Llama, Mistral, Gemma, and DeepSeek variants.

Mitigation Steps:

  • Implement multi-agent specific safety mechanisms that are robust against distributed attacks.
  • Design systems to minimize the network's susceptibility to maximum-flow minimum-cost attacks. This may include adjusting network topology or bandwidth constraints.
  • Develop more sophisticated safety filters that can detect fragmented adversarial prompts, and account for variations in message arrival order.
  • Regularly evaluate and update safety mechanisms using adversarial training techniques to proactively counter emerging attacks.

© 2025 Promptfoo. All rights reserved.