Multimodal Linking Visual Insecurity

Description: Multimodal Entity Linking (MEL) systems, encompassing both traditional dual-encoder models and Multimodal Large Language Models (MLLMs), are vulnerable to gradient-based white-box adversarial attacks. By applying imperceptible perturbations to visual inputs via Projected Gradient Descent (PGD), Auto-PGD (APGD), or Carlini & Wagner (CW) methods, an attacker can manipulate the visual embeddings generated by the model. This manipulation disrupts the cross-modal alignment structure, causing the model to incorrectly link visual content to unrelated entities in a knowledge base during Image-to-Text (I2T) and Image+Text-to-Text (IT2T) tasks. The vulnerability stems from the models' reliance on visual inputs that lack sufficient robustness against noise when textual context is absent or insufficient.

Examples: The vulnerability is reproducible using the specific adversarial generation parameters and dataset provided by the researchers.

Repository: See https://anonymous.4open.science/r/MEL-Robustness-90A5 for the MEL adversarial example dataset constructed on Wikidata-MEL, Richpedia-MEL, WikiDiverse, WIKIPerson, and M3EL.
Attack Vector Implementation: To reproduce the attack, the adversary calculates the gradient of the loss function with respect to the input image pixels:

PGD/APGD: Maximize the Cross-Entropy loss between the model logits and the ground-truth label, constrained by an $L_{\infty}$ ball. $$ \max_{|\delta|_{\infty}\leq\epsilon}\mathcal{L}(f(x+\delta),y) $$ (Where $\epsilon=8/255$ for Normal attacks and $\epsilon=0.2$ for Strong attacks).
CW Attack: Minimize the sum of the $L_2$ distance of the perturbation $\delta$ and the objective function $f$.
Result: In the case of Qwen2.5-VL on the WIKIPerson dataset, a CW-Strong attack results in a 35.8% drop in linking accuracy.

Impact:

Entity Misidentification: Successful attacks cause the system to link visual subjects to incorrect knowledge graph entities (e.g., misidentifying a specific person or object).
Downstream Corruption: Compromises systems relying on MEL for Knowledge-Enhanced Question & Answering (QA), Image-Text Retrieval, and Open-Domain Entity Alignment.
Model Instability: High susceptibility to visual noise leads to unpredictable behavior in real-world deployments where image quality varies.

Affected Systems:

Traditional MEL Models: ALIGN, BLIP, CLIP, FLAVA, OWL-ViT, SigLIP.
Multimodal Large Language Models (MLLMs): LLaVA (based on LLaMA/Vicuna), Qwen2.5-VL, MiniGPT-4.

Mitigation Steps:

Implement LLM-RetLink (LLM and Retrieval-Augmented Entity Linking): Adopt a two-stage architecture that integrates Large Vision Models (LVMs) with web-based dynamic retrieval.
Extract Explicit Descriptors: Use LVMs to automatically extract word-level entity descriptors (e.g., "apple", "fruit") from input images to serve as query cues, reducing reliance on raw visual embeddings.
Retrieval-Augmented Context: Dynamically query external knowledge bases (e.g., Wikidata, Wikipedia) to obtain sentence-level descriptions associated with candidates.
LLM-Based Disambiguation: Feed the retrieved content into an LLM to perform semantic matching between candidates and context, rather than relying on static entity descriptions or pure visual alignment.
Contextual Integration: Ensure textual information is incorporated alongside visual input, as experimental results indicate that contextual semantic information partially mitigates the impact of visual perturbations.

Multimodal Linking Visual Insecurity

Research Paper