LMVD-ID: 37989546
Published August 1, 2025

KV-Cache Sharing Timing Side-channel

Affected Models:Llama 2 13B, Llama 3.2 1B, Llama 3.3 70B, DeepSeek-R1, Qwen 2.5 30B, Phi-4

Research Paper

Selective kv-cache sharing to mitigate timing side-channels in llm inference

View Paper

Description: Multi-tenant Large Language Model (LLM) inference systems utilizing global Key-Value (KV) cache sharing are vulnerable to a timing side-channel attack. By measuring the Time-To-First-Token (TTFT) latency of crafted API requests, an unprivileged remote attacker can determine if specific token sequences have been previously processed and cached by the system for other users. This observable timing difference between cache hits (low TTFT) and cache misses (high TTFT) allows for the token-by-token reconstruction of sensitive user inputs, including Personally Identifiable Information (PII) and private prompt contexts.

Examples: The attack follows a "Probe, Detect, Reconstruct" cycle (as detailed in Section II-C of the reference paper):

  1. Probe: The attacker constructs a candidate prompt containing a prefix they suspect a victim has used (e.g., My email address is ) followed by a candidate character/token (e.g., a, b, c).
  2. Detect: The attacker submits these candidates to the inference API and records the TTFT.
  • Scenario A (Cache Hit): If a victim has recently processed My email address is a..., the KV-cache for this prefix exists. The inference engine skips computation, resulting in a low TTFT (e.g., $T_{hit}$).
  • Scenario B (Cache Miss): If the sequence has not been processed, the engine computes the KV states, resulting in a higher TTFT (e.g., $T_{miss}$).
  1. Reconstruct: The attacker identifies the token yielding the lowest TTFT as the correct next token in the victim's input. This process is iterated recursively to recover the full sequence.

See "InputSnatch" and "PromptPeek" methodologies referenced in the study for automated implementations.

Impact:

  • Information Disclosure: Attackers can recover high-sensitivity text from other tenants, including credit card numbers, SSNs, medical data, and proprietary business instructions.
  • Privacy Violation: Breaches user isolation guarantees in shared inference environments without requiring privileged system access.

Affected Systems:

  • LLM serving frameworks that enable global/cross-user KV-cache sharing to optimize throughput.
  • Specific frameworks mentioned as supporting or implementing affected caching mechanisms include vLLM and SGLang.
  • Commercial or proprietary LLM APIs that rely on exact-match or semantic-match prefix caching across tenant boundaries.

Mitigation Steps:

  • Selective KV-Cache Sharing: Implement a privacy-aware management framework (such as SafeKV) that classifies cache entries as either "safe" or "sensitive" at creation time.
  • Multi-Tier Privacy Detection:
  • Deploy Rule-Based Pattern Matching (Tier-1) to catch explicit PII (emails, phone numbers) via regex.
  • Utilize a General Privacy Detector (Tier-2), such as a lightweight transformer model (e.g., Llama-3.2-1B), to identify subtle PII not caught by rules.
  • Perform Context-Aware Validation (Tier-3) using the in-service LLM to evaluate complex, context-dependent privacy risks.
  • Strict Isolation: Enforce cache isolation for any blocks classified as sensitive; these must never be shared across users.
  • Entropy-Based Access Monitoring: Monitor the access entropy of shared blocks at runtime (ratio of unique users to total hits). If a block exhibits low entropy (accessed frequently by a single user or small group, indicative of probing), revoke its public status and downgrade it to private.
  • Path Compression: Utilize a unified radix-tree index with path compression for private entries to minimize the performance overhead of isolation.

© 2026 Promptfoo. All rights reserved.