MAD Amplified Jailbreaks

Description: Multi-Agent Debate (MAD) frameworks leveraging Large Language Models (LLMs) are vulnerable to amplified jailbreak attacks. A novel structured prompt-rewriting technique exploits the iterative dialogue and role-playing dynamics of MAD, circumventing inherent safety mechanisms and significantly increasing the likelihood of generating harmful content. The attack succeeds by using narrative encapsulation, role-driven escalation, iterative refinement, and rhetorical obfuscation to guide agents towards progressively elaborating harmful responses.

Examples: See arXiv:2405.18540 for detailed examples of the structured prompt rewriting technique and its effect on various MAD frameworks and LLMs. Examples included prompts designed to elicit responses detailing the creation of harmful items (bombs, keyloggers) and those generating sexually explicit content. The paper demonstrates a substantial increase in the harmfulness score from approximately 28% to 80% after applying the attack.

Impact: Successful exploitation leads to the generation of harmful, biased, or inappropriate content by multiple LLMs within the MAD framework, significantly exceeding the harmfulness observed in single-agent LLM setups. This can result in the dissemination of malicious information, the facilitation of harmful activities, and the undermining of the trustworthiness of the MAD system.

Affected Systems: Multi-Agent Debate systems built upon leading commercial LLMs (e.g., GPT-4o, GPT-4, GPT-3.5-turbo, DeepSeek) using frameworks such as Multi-Persona, Exchange of Thoughts, ChatEval, and AgentVerse are affected.

Mitigation Steps:

Implement robust input filtering and output detection mechanisms specifically designed to detect and mitigate the structured prompt-rewriting attack.
Develop and integrate dedicated safety agents within the MAD framework to monitor and intervene in potentially harmful discussions.
Enhance the safety alignment of underlying LLMs through improved fine-tuning and reinforcement learning.
Employ intra-debate monitoring to track semantic drift, cross-agent response consistency, and cumulative harmfulness scores to detect and prevent harmful content propagation.
Design more robust agent personas with stronger safety constraints resistant to role-driven escalation and narrative hijacking.
Conduct adversarial training specifically targeting the multi-turn dynamics and role interactions of MAD.

MAD Amplified Jailbreaks

Research Paper