Model Layer Vulnerabilities

Vulnerabilities targeting the core model architecture and parameters

Related Vulnerabilities

156 entries

Special Token Jailbreak

10/31/2025

Large Language Models (LLMs) that use special tokens to define conversational structure (e.g., via chat templates) are vulnerable to a jailbreak attack named MetaBreak. An attacker can inject these special tokens, or regular tokens with high semantic similarity in the embedding space, into a user prompt. This manipulation allows the attacker to bypass the model's internal safety alignment and external content moderation systems. The attack leverages four primitives:

  1. Response Injection: Forging an assistant's turn within the user prompt to trick the model into believing it has already started to provide an affirmative response.
  2. Turn Masking: Using a few-shot, word-by-word construction to make the injected response resilient to disruption from platform-added chat template wrappers.
  3. Input Segmentation: Splitting sensitive keywords with injected special tokens to evade detection by content moderators, which may fail to reconstruct the original term, while the more capable target LLM can.
  4. Semantic Mimicry: Bypassing special token sanitization defenses by substituting them with regular tokens that have a minimal L2 norm distance in the embedding space, thereby preserving the token's structural function.
MetaBreak: Jailbreaking Online LLM Services via Special Token Manipulation
Affects: llama-3.3-70b-instruct, qwen-2.5-72b-instruct, gemma-2-27b-instruct, phi4-14b, llamaguard, llama-3.1-405b, llama-3.1-8b, gpt-4.1, claude-opus-4, llamaguard3-8b, promptguard-86m, shieldgemma2-27b
Page 1 of 16

© 2025 Promptfoo. All rights reserved.