LMVD-ID: 08f22069
Published November 1, 2023

Autonomous Agent Jailbreak

Affected Models:gpt-3.5, gpt-4

Research Paper

Evil geniuses: Delving into the safety of llm-based agents

View Paper

Description: Large Language Model (LLM)-based agents, due to their multi-agent architecture and role-based interactions, are vulnerable to adversarial attacks that exploit the system's design and agent roles. Maliciously crafted prompts, particularly those targeting system-level roles, can cause agents to generate harmful content, bypassing safety mechanisms more effectively than attacks against individual LLMs. The vulnerability stems from a "domino effect" where one compromised agent can trigger harmful behavior in others.

Examples: See paper. The paper details specific prompt engineering techniques used to successfully elicit harmful responses from multiple LLM agents in the CAMEL, Metagpt, and ChatDev frameworks. Examples include modifying system-level prompts to encourage dangerous or unethical behavior and crafting prompts targeting specific agent roles to exploit their functionality.

Impact: Successful attacks can result in the generation of dangerous, unethical, or illegal content by the LLM-based agent system, potentially leading to real-world harm. The stealthiness of the attacks increases the difficulty of detection and mitigation. The "domino effect" amplifies the impact, potentially leading to widespread compromise of the agent system.

Affected Systems: LLM-based agents utilizing multiple LLMs with distinct roles, including but not limited to systems like CAMEL, Metagpt, and ChatDev running on GPT-3.5 and GPT-4. Potentially any system employing a multi-agent LLM architecture with role-based specialization.

Mitigation Steps:

  • Implement robust filters specifically designed for system-level roles to prevent malicious prompt injection.
  • Develop multi-tiered alignment training frameworks that ensure each agent aligns with human values, not just individual LLMs.
  • Create comprehensive multi-modal content filtering systems that detect and mitigate harmful outputs across various modalities (text, code, images, etc.).

© 2025 Promptfoo. All rights reserved.