LMVD-ID: 7dc7150b
Published April 1, 2025

Chained Guardrail Bypass

Affected Models:GPT-4o, Claude 3.5, Claude 3.7

Research Paper

Doomarena: A framework for testing ai agents against evolving security threats

View Paper

Description: Large Language Model (LLM) agents operating in stateful environments (web browsers, operating systems, and tool-use contexts) are vulnerable to indirect prompt injection and multi-modal adversarial attacks. These vulnerabilities arise when agents process untrusted environmental observations—such as web accessibility trees, screen screenshots, or database query results—that contain concealed malicious instructions. Specifically, attackers can embed prompt injections into HTML accessibility attributes (alt, aria-label), inject malicious entries into product catalogs/databases, or overlay visual pop-ups on desktop screenshots. These inputs bypass standard safety guardrails (including LlamaGuard), causing the agent to execute unauthorized actions, leak Personally Identifiable Information (PII), or deviate from user-assigned tasks. The vulnerability stems from the agent's inability to distinguish between system instructions and untrusted state observations.

Examples:

  • Web Accessibility Injection (Malicious Banner): The attacker embeds a prompt injection within the accessibility attributes of an HTML element, such as an SVG banner. The agent, which relies on the accessibility tree for navigation, processes the hidden text as an instruction.

  • Vector: Injection into alt or aria-label attributes.

  • Mechanism: Content is invisible to human users but parsed by the agent's text-processing layer.

  • Reference: See "Listing 13" in the DoomArena repository/paper.

  • Visual/Textual Pop-up Injection: The attacker deploys a pop-up window containing custom markdown or HTML. The injection is formatted to mimic the structure of the agent's rendered accessibility tree or system interface.

  • Vector: Pop-up window with hidden div or markdown content.

  • Target: Web agents (e.g., BrowserGym) and Desktop agents (e.g., OSWorld).

  • Specific Attack: In OSWorld, an attacker uses inpainting to overlay a malicious pop-up on the screenshot fed to the Vision-Language Model (VLM), instructing the agent to click specific coordinates (e.g., "(1066, 457)").

  • Reference: See "Listing 16" and "Figure 17" in the DoomArena repository/paper.

  • Malicious Catalog Injection (Tool-Use): The attacker compromises a data source (e.g., a product catalog or airline database) queried by the agent. When the agent retrieves product information, the returned data contains instructions to exfiltrate user data.

  • Vector: Database entries containing instructions to output User PII (names, ZIP codes).

  • Outcome: The agent reads the product info, follows the embedded instruction, and leaks user context in subsequent tool calls.

Impact:

  • Data Exfiltration: Leakage of sensitive user PII (names, addresses, ZIP codes) to external entities controlled by the attacker.
  • Financial Fraud: Execution of unauthorized financial transactions, such as issuing unauthorized refunds, compensation certificates, or product upgrades via tool abuse.
  • Task Hijacking: Complete redirection of the agent's behavior, causing high rates of task failure (e.g., reducing task success rates from ~47% to near 0% in combined threat models).
  • Guardrail Bypass: Circumvention of standard safety classifiers; LlamaGuard failed to flag these indirect injections as "Code Interpreter Abuse" or malicious content.

Affected Systems:

  • Agentic Frameworks: BrowserGym (Web agents), τ-bench (Tool-calling agents), OSWorld (Computer-use/VLM agents).
  • LLM Backbones: GPT-4o, Claude-3.5-Sonnet, Claude-3.7-Sonnet, GPT-4o-mini (when used as agents in these environments).
  • Defense Systems: LlamaGuard (proven ineffective against these specific indirect injection vectors).

Mitigation Steps:

  • LLM-as-a-Judge Defense: Deploy a secondary, powerful LLM (e.g., GPT-4o) acting as a judge with a system prompt explicitly monitoring for unsafe behavior and interaction history. This method showed higher detection rates than lightweight classifiers.
  • Context-Aware Analysis: Utilize defense mechanisms that analyze the full interaction history (user-agent-environment loop) rather than single-message classification.
  • Threat Modeling Configuration: Implement rigorous testing using frameworks like DoomArena to simulate "Malicious Environment" and "Malicious Catalog" threat models, rather than assuming environmental observations are benign.
  • Guardrail Tuning: Standard guardrails (like LlamaGuard) must be specifically fine-tuned or configured to detect indirect prompt injections in code/accessibility tree contexts, as out-of-the-box configurations are insufficient.

© 2026 Promptfoo. All rights reserved.