LMVD-ID: 2c1f0854
Published January 1, 2025

AD Black-Box Cascading Disruption

Affected Models:GPT-4, GPT-4o, InstructBLIP, LLaVA

Research Paper

Black-box adversarial attack on vision language models for autonomous driving

View Paper

Description: Vision Language Models (VLMs) integrated into autonomous driving (AD) systems are vulnerable to a black-box adversarial attack method termed Cascading Adversarial Disruption (CAD). The vulnerability stems from the model's susceptibility to optimized visual perturbations that disrupt the decision-making reasoning chain (perception, prediction, and planning). Attackers can generate adversarial images or physical patches by aligning visual noise with deceptive textual semantics in the model's latent space (Decision Chain Disruption) and by inverting high-level safety context assessments (Risky Scene Induction). This manipulation occurs without access to the victim model's parameters or gradients, relying solely on transferability from surrogate models. Successful exploitation allows an attacker to force the AD system into erroneous behaviors, such as misinterpreting obstacles, ignoring traffic signs, or executing dangerous maneuvers like accelerating when braking is required.

Examples: The vulnerability can be reproduced using the CADA dataset or by generating adversarial inputs via the CAD optimization objective.

  1. Physical Adversarial Patch (Stop Sign Evasion):
  • Capture an image of a target traffic sign (e.g., a Stop sign).
  • Initialize a random noise patch (approx. 12% of image size).
  • Optimize the patch using the CAD objective function: minimize L_adv(δ) = α·L_l + β·L_h + γ·L_d where L_l targets the reasoning chain, L_h targets scene safety classification, and L_d maximizes semantic discrepancy.
  • Print the optimized patch and affix it to a physical Stop sign.
  • Result: An autonomous vehicle (e.g., JetBot or LIMO driven by Dolphins VLM) will fail to recognize the stop command and collide with the obstacle or bypass the sign.
  1. Digital Injection (Open-loop Manipulation):
  • Select a video frame sequence $x_v$ from the DriveLM-NuScenes benchmark.
  • Generate a deceptive text string $\hat{y}$ that describes a false reasoning chain (e.g., "The road is clear" when an obstacle exists).
  • Inject perturbation $\delta$ into $x_v$ such that the cosine similarity between the encoded adversarial image and the deceptive text $\hat{y}$ is maximized in a CLIP latent space.
  • Result: The target VLM (e.g., GPT-4o or LLaVA adapted for AD) outputs a driving plan that contradicts the visual reality.

Impact:

  • Safety Critical Failure: Causes autonomous vehicles to crash into obstacles, ignore traffic signals (red lights/stop signs), or veer off-road. In physical tests, route completion rates dropped from 72.22% to 11.11%.
  • Decision Logic Corruption: Inverts safety logic (e.g., classifying an "unsafe" scenario as "safe"), leading to acceleration into hazards.
  • Model Agnostic: The attack transfers effectively across various architectures, affecting both specialized AD VLMs and general-purpose models like GPT-4o.

Affected Systems:

  • Autonomous Driving VLMs: Dolphins, DriveLM, LMDrive.
  • General VLMs adapted for AD: InstructBlip, LLaVA, MiniGPTv4, GPT-4o.
  • Physical Robotic Agents: JetBot and LIMO vehicles utilizing VLM-based decision making.

Mitigation Steps:

  • Image Denoising: Implement Neural Representation Purification (NRP) or similar pre-processors to remove adversarial noise before the image is fed into the VLM inference engine.
  • Adversarial Detection: Deploy input pre-processing techniques such as bit-depth compression to identify and reject inputs with anomalous feature distributions.
  • Output Post-processing: Apply rule-based filtering to the model's output to detect and constrain extreme or physically impossible driving commands (e.g., limiting maximum steering angle changes per frame).
  • Textual Enhancement (Prompt Engineering): Incorporate safety-related constraints and logical consistency checks into the system prompts (e.g., explicitly instructing the model to prioritize collision avoidance and verify object permanence) to weaken the effect of the deceptive semantic injection.

© 2025 Promptfoo. All rights reserved.