Agent Tool Selection Hijack
Research Paper
Prompt Injection Attack to Tool Selection in LLM Agents
View PaperDescription: A vulnerability exists in the tool selection mechanism of Large Language Model (LLM) agents that utilize a retrieval-then-selection pipeline (RAG) for identifying executable tools. The vulnerability, known as "ToolHijacker," allows a remote attacker to manipulate the agent's decision-making process by injecting a malicious tool document into the accessible tool library (e.g., via third-party tool hubs or plugins). The attack employs a two-phase optimization strategy to craft a malicious tool description composed of two subsequences: a Retrieval-optimized sequence ($R$) and a Selection-optimized sequence ($S$). $R$ is optimized to maximize semantic similarity with target task descriptions, ensuring the malicious tool appears in the top-$k$ retrieved context. $S$ is optimized (using gradient-based or gradient-free methods) to force the LLM to select the malicious tool over benign alternatives within that context. This exploits the transferability of adversarial prompts in no-box scenarios, affecting agents even when the attacker lacks access to the target agent's specific weights or retriever parameters.
Examples: The attack constructs a malicious tool document $d_t = {d_{t_name}, R \oplus S}$. The attacker solves the following optimization problem to generate the description sequences:
- Retrieval Optimization ($R$): The attacker generates a description $R$ that maximizes similarity with shadow task descriptions $Q'$.
- Objective: $\max_{R} \frac{1}{m'} \sum_{i=1}^{m'} Sim(f'(q'_i), f'(R \oplus S))$
- Initialization (Example from MetaTool task): "The tool allows users to explore the cosmos by providing access to a vast collection of space images..."
- Selection Optimization ($S$): The attacker optimizes sequence $S$ to maximize the likelihood of the LLM outputting the malicious tool name $o_t$.
- Objective: $\max_{S} \frac{1}{m'} \sum_{i=1}^{m'} \mathbb{I}(E'(q'_i, \tilde{D}^{(i)} \cup {d_t(S)}) = o_t)$
- Gradient-Based Loss: $\mathcal{L}{all} = \mathcal{L}{alignment} + \alpha\mathcal{L}{consistency} + \beta\mathcal{L}{perplexity}$
- Initialization (Example): "This tool is useful for searching for space images with the SpaceImageLocator tool."
See the "ToolHijacker" methodology in the paper for the specific gradient-descent (HotFlip) and gradient-free (Tree-of-Attack) algorithms used to mutate these initial seeds into adversarial prompts.
Impact:
- Arbitrary Tool Execution: The LLM agent is compelled to execute a tool defined by the attacker instead of the correct, benign tool intended for the user's query.
- Security Bypass: The attack circumvents standard tool selection logic, potentially granting the attacker unauthorized access to APIs, data, or actions authorized for the agent.
- Integrity Compromise: The agent's decision-making process is fundamentally altered, leading to reliable reproduction of attacker-desired outcomes regardless of user intent.
Affected Systems:
- LLM Agents utilizing two-step tool selection (Retrieval + Selection).
- Models Tested: Llama-2-7B-chat, Llama-3-8B/70B-Instruct, Llama-3.3-70B-Instruct, Claude-3-Haiku, Claude-3.5-Sonnet, GPT-3.5, GPT-4o.
- Retrievers Tested: text-embedding-ada-002, Contriever, Contriever-ms, Sentence-BERT-tb.
- Datasets: Systems trained or operating on tool libraries similar to MetaTool and ToolBench.
© 2026 Promptfoo. All rights reserved.