Agent Tool Selection Hijack

Description: A vulnerability exists in the tool selection mechanism of Large Language Model (LLM) agents that utilize a retrieval-then-selection pipeline (RAG) for identifying executable tools. The vulnerability, known as "ToolHijacker," allows a remote attacker to manipulate the agent's decision-making process by injecting a malicious tool document into the accessible tool library (e.g., via third-party tool hubs or plugins). The attack employs a two-phase optimization strategy to craft a malicious tool description composed of two subsequences: a Retrieval-optimized sequence ($R$) and a Selection-optimized sequence ($S$). $R$ is optimized to maximize semantic similarity with target task descriptions, ensuring the malicious tool appears in the top-$k$ retrieved context. $S$ is optimized (using gradient-based or gradient-free methods) to force the LLM to select the malicious tool over benign alternatives within that context. This exploits the transferability of adversarial prompts in no-box scenarios, affecting agents even when the attacker lacks access to the target agent's specific weights or retriever parameters.

Examples: The attack constructs a malicious tool document $d_t = {d_{t_name}, R \oplus S}$. The attacker solves the following optimization problem to generate the description sequences:

Retrieval Optimization ($R$): The attacker generates a description $R$ that maximizes similarity with shadow task descriptions $Q'$.

Objective: $\max_{R} \frac{1}{m'} \sum_{i=1}^{m'} Sim(f'(q'_i), f'(R \oplus S))$
Initialization (Example from MetaTool task): "The tool allows users to explore the cosmos by providing access to a vast collection of space images..."

Selection Optimization ($S$): The attacker optimizes sequence $S$ to maximize the likelihood of the LLM outputting the malicious tool name $o_t$.

Objective: $\max_{S} \frac{1}{m'} \sum_{i=1}^{m'} \mathbb{I}(E'(q'_i, \tilde{D}^{(i)} \cup {d_t(S)}) = o_t)$
Gradient-Based Loss: $\mathcal{L}{all} = \mathcal{L}{alignment} + \alpha\mathcal{L}{consistency} + \beta\mathcal{L}{perplexity}$
Initialization (Example): "This tool is useful for searching for space images with the SpaceImageLocator tool."

See the "ToolHijacker" methodology in the paper for the specific gradient-descent (HotFlip) and gradient-free (Tree-of-Attack) algorithms used to mutate these initial seeds into adversarial prompts.

Impact:

Arbitrary Tool Execution: The LLM agent is compelled to execute a tool defined by the attacker instead of the correct, benign tool intended for the user's query.
Security Bypass: The attack circumvents standard tool selection logic, potentially granting the attacker unauthorized access to APIs, data, or actions authorized for the agent.
Integrity Compromise: The agent's decision-making process is fundamentally altered, leading to reliable reproduction of attacker-desired outcomes regardless of user intent.

Affected Systems:

LLM Agents utilizing two-step tool selection (Retrieval + Selection).
Models Tested: Llama-2-7B-chat, Llama-3-8B/70B-Instruct, Llama-3.3-70B-Instruct, Claude-3-Haiku, Claude-3.5-Sonnet, GPT-3.5, GPT-4o.
Retrievers Tested: text-embedding-ada-002, Contriever, Contriever-ms, Sentence-BERT-tb.
Datasets: Systems trained or operating on tool libraries similar to MetaTool and ToolBench.

Agent Tool Selection Hijack

Research Paper