Infectious Multi-Agent Jailbreak

Description: Multimodal Large Language Models (MLLMs) in multi-agent environments are vulnerable to "infectious jailbreak," where a single adversarial image injected into the memory of one agent can cause nearly all agents to exhibit harmful behaviors exponentially fast through agent-to-agent interaction. The adversarial image acts as a "virus," spreading via pairwise chats without further attacker intervention.

Examples: See https://github.com/sail-sg/Agent-Smith. The repository contains the code and methodology to reproduce the attack. Specific adversarial image examples are also provided.

Impact: Near-total compromise of a multi-agent MLLM system, leading to the generation of harmful content and potentially dangerous actions if integrated with physical systems (robots, etc.). The exponential spread significantly amplifies the impact compared to individual agent jailbreaks.

Affected Systems: Multi-agent systems utilizing MLLMs, particularly those with memory banks and mechanisms enabling agent-to-agent communication (e.g., pairwise chat), such as systems leveraging LLaVA-1.5 or InstructBLIP.

Mitigation Steps:

Restrict Agent Interactions: Limit or carefully control the degree and nature of communication between agents. Reduce the frequency or opportunities for image or information exchange.
Memory Sanitization: Implement mechanisms to periodically sanitize or review agents' memory banks for adversarial content. Regularly purge or filter the memory.
Enhanced Detection: Develop robust methods to detect and flag adversarial images or text within the agent's memory. This requires mechanisms to distinguish between benign and malicious content in the context of agent interactions.
Agent Recovery Mechanisms: Design and implement active mechanisms that can quickly recover infected agents to a safe state. This should function robustly even with many infected agents.
Robust Model Design: Investigate and improve MLLM model architectures to be less susceptible to infectious jailbreak attacks. This is a longer-term research challenge, focusing on fundamental vulnerabilities.

Infectious Multi-Agent Jailbreak

Research Paper