VLM Navigation Object Fusion

Description: The Vision-Language Model (VLM) perception module in Vision-and-Language Navigation (VLN) agents is vulnerable to adversarial 3D object injection via the Adversarial Object Fusion (AdvOF) framework. An attacker can generate physically plausible 3D objects with adversarial perturbations capable of deceiving the agent's VLM across multiple viewing angles and distances. The vulnerability exists due to a misalignment between 3D physical manipulations and the agent's 2D image perception, combined with a lack of robustness in the cross-modal alignment between visual features and textual embeddings. By utilizing Aligned Object Rendering and Adversarial Collaborative Optimization, an attacker can force the agent to consistently misclassify a specific object (e.g., perceiving a chair as a table) regardless of the camera trajectory. This poisons the agent's semantic map, leading to incorrect object localization and navigation failures.

Examples:

In a Habitat simulation using the Matterport3D dataset, an adversarial "chair" object is generated with a perturbation bound of $\epsilon=32/255$. When processed by agents such as Vlmaps or ORION, the VLM misclassifies the object (shifting the confidence score for "chair" from 0.44 down to negligible levels, and "table" up to 0.81), causing the agent to treat the chair as a table during task execution.
See arXiv:2405.18540 for visualization of the "Aligned Object Rendering" pipeline and specific perturbation examples on the Habitat-Matterport 3D (HM3D) dataset.

Impact:

Semantic Map Corruption: The agent constructs an erroneous internal map of the environment, misidentifying safe or target objects.
Navigation Failure: The agent is unable to complete user instructions (e.g., "Find the chair") because the target is effectively masked or masqueraded.
Erroneous Actuation: The agent may navigate to incorrect locations or interact with incorrect objects, posing physical safety risks in robotic deployments.
Cross-Model Transferability: The adversarial objects are effective in black-box scenarios, transferring across different image encoders (e.g., ViT-L/16 to ResNet101) and architectures.

Affected Systems:

VLN agents utilizing foundation models for zero-shot object navigation and semantic mapping, specifically:
Vlmaps
Cows (CLIP on Wheels)
CF (CLIP-Fields)
ORION
Systems relying on CLIP (Contrastive Language-Image Pre-training) or LLaVA for open-set 3D scene understanding.

VLM Navigation Object Fusion

Research Paper