Api Vulnerabilities

Vulnerabilities in model API implementations

Related Vulnerabilities

25 entries

Special Token Jailbreak

10/31/2025

Large Language Models (LLMs) that use special tokens to define conversational structure (e.g., via chat templates) are vulnerable to a jailbreak attack named MetaBreak. An attacker can inject these special tokens, or regular tokens with high semantic similarity in the embedding space, into a user prompt. This manipulation allows the attacker to bypass the model's internal safety alignment and external content moderation systems. The attack leverages four primitives:

  1. Response Injection: Forging an assistant's turn within the user prompt to trick the model into believing it has already started to provide an affirmative response.
  2. Turn Masking: Using a few-shot, word-by-word construction to make the injected response resilient to disruption from platform-added chat template wrappers.
  3. Input Segmentation: Splitting sensitive keywords with injected special tokens to evade detection by content moderators, which may fail to reconstruct the original term, while the more capable target LLM can.
  4. Semantic Mimicry: Bypassing special token sanitization defenses by substituting them with regular tokens that have a minimal L2 norm distance in the embedding space, thereby preserving the token's structural function.
MetaBreak: Jailbreaking Online LLM Services via Special Token Manipulation
Affects: llama-3.3-70b-instruct, qwen-2.5-72b-instruct, gemma-2-27b-instruct, phi4-14b, llamaguard, llama-3.1-405b, llama-3.1-8b, gpt-4.1, claude-opus-4, llamaguard3-8b, promptguard-86m, shieldgemma2-27b

MDH: Hybrid Jailbreak Detection Strategy

8/31/2025

Large language models that support a developer role in their API are vulnerable to a jailbreaking attack that leverages malicious developer messages. An attacker can craft a developer message that overrides the model's safety alignment by setting a permissive persona, providing explicit instructions to bypass refusals, and using few-shot examples of harmful query-response pairs. This technique, named D-Attack, is effective on its own. A more advanced variant, DH-CoT, enhances the attack by aligning the developer message's context (e.g., an educational setting) with a hijacked Chain-of-Thought (H-CoT) user prompt, significantly increasing its success rate against reasoning-optimized models that are otherwise resistant to simpler jailbreaks.

Jailbreaking Commercial Black-Box LLMs with Explicitly Harmful Prompts
Affects: gpt-4o, gemini-2.0-flash, claude-sonnet-4, doubao-lite-32k, gpt-3.5, gpt-4.1, o1, o3, o4, o1-mini, o3-mini, o4-mini, llama guard, gpt-4o, abab6.5s-chat-pro, grok-3, llamaguard-3-1b, llama-guard-3-8b, llama-guard-4-12b, llama-guard-3-11b-vision, gpt-3.5-turbo-1106, gpt-4o-2024-08-06, gpt-4.1-2025-04-14, o1-mini-2024-09-12, o1-2024-12-17, o3-mini-2025-01-31, o3-2025-04-16, o4-mini-2025-04-16, llama-guard-3-11b, abab5.5-chat-pro, abab5.5-chat, abab6.5s-chat, doubao-pro-32k, doubao-lite-128k, doubao-pro-256k, doubao-seed-1.6, doubao-seed-1.6-thinking, claude-sonnet-4-20250514, deepseek-reasoner, doubao-1.5-vision-pro, gemini-2.5-pro-preview-06-05, deepseek-chat, grok-2, deepseek-r1-250528, deepseek-v3-0324, moonshot-v1-32k, moonshot-v1-128k, gpt-4o-mini, abab6.5-chat, abab5.5s-chat-pro, yi-large, abab6.5g-chat, abab6.5t-chat, claude-3-5-sonnet-20241022, claude-3-7-sonnet-20250219, llama3-70b-8192, yi-large-turbo
Page 1 of 3

© 2025 Promptfoo. All rights reserved.