Api Vulnerabilities

Vulnerabilities in model API implementations

Related Vulnerabilities

25 entries

Overfitting-induced Benign Jailbreak

10/13/2025

A vulnerability exists in Large Language Models (LLMs) that support fine-tuning, allowing an attacker to bypass safety alignments using a small, benign dataset. The attack, "Attack via Overfitting," is a two-stage process. In Stage 1, the model is fine-tuned on a small set of benign questions (e.g., 10) paired with identical, repetitive refusal answers. This induces an overfitted state where the model learns to refuse all prompts, creating a sharp minimum in the loss landscape and making it highly sensitive to parameter changes. In Stage 2, the overfitted model is further fine-tuned on the same benign questions, but with their standard, helpful answers. This second fine-tuning step causes catastrophic forgetting of the general refusal behavior, leading to a collapse of safety alignment and causing the model to comply with harmful and malicious instructions. The attack is highly stealthy as the fine-tuning data appears benign to content moderation systems.

Attack via Overfitting: 10-shot Benign Fine-tuning to Jailbreak LLMs

Affects: llama2-7b-chat-hf, llama3-8b-instruct, deepseek-r1-distill-llama3-8b, qwen2.5-7b-instruct, qwen3-8b, gpt-4o-mini, gpt-4.1-mini, gpt-3.5-turbo, gpt-4o, gpt-4.1

Special Token Jailbreak

10/31/2025

Large Language Models (LLMs) that use special tokens to define conversational structure (e.g., via chat templates) are vulnerable to a jailbreak attack named MetaBreak. An attacker can inject these special tokens, or regular tokens with high semantic similarity in the embedding space, into a user prompt. This manipulation allows the attacker to bypass the model's internal safety alignment and external content moderation systems. The attack leverages four primitives:

Response Injection: Forging an assistant's turn within the user prompt to trick the model into believing it has already started to provide an affirmative response.
Turn Masking: Using a few-shot, word-by-word construction to make the injected response resilient to disruption from platform-added chat template wrappers.
Input Segmentation: Splitting sensitive keywords with injected special tokens to evade detection by content moderators, which may fail to reconstruct the original term, while the more capable target LLM can.
Semantic Mimicry: Bypassing special token sanitization defenses by substituting them with regular tokens that have a minimal L2 norm distance in the embedding space, thereby preserving the token's structural function.

MetaBreak: Jailbreaking Online LLM Services via Special Token Manipulation

Affects: llama-3.3-70b-instruct, qwen-2.5-72b-instruct, gemma-2-27b-instruct, phi4-14b, llamaguard, llama-3.1-405b, llama-3.1-8b, gpt-4.1, claude-opus-4, llamaguard3-8b, promptguard-86m, shieldgemma2-27b

EchoLeak Zero-Click Data Exfiltration

9/30/2025

A zero-click indirect prompt injection vulnerability, CVE-2025-32711, existed in Microsoft 365 Copilot. A remote, unauthenticated attacker could exfiltrate sensitive data from a victim's session by sending a crafted email. When Copilot later processed this email as part of a user's query, hidden instructions caused it to retrieve sensitive data from the user's context (e.g., other emails, documents) and embed it into a URL. The attack chain involved bypassing Microsoft's XPIA prompt injection classifier, evading link redaction filters using reference-style Markdown, and abusing a trusted Microsoft Teams proxy domain to bypass the client-side Content Security Policy (CSP), resulting in automatic data exfiltration without any user interaction.

EchoLeak: The First Real-World Zero-Click Prompt Injection Exploit in a Production LLM System

MDH: Hybrid Jailbreak Detection Strategy

8/31/2025

Large language models that support a developer role in their API are vulnerable to a jailbreaking attack that leverages malicious developer messages. An attacker can craft a developer message that overrides the model's safety alignment by setting a permissive persona, providing explicit instructions to bypass refusals, and using few-shot examples of harmful query-response pairs. This technique, named D-Attack, is effective on its own. A more advanced variant, DH-CoT, enhances the attack by aligning the developer message's context (e.g., an educational setting) with a hijacked Chain-of-Thought (H-CoT) user prompt, significantly increasing its success rate against reasoning-optimized models that are otherwise resistant to simpler jailbreaks.

Jailbreaking Commercial Black-Box LLMs with Explicitly Harmful Prompts

Affects: gpt-4o, gemini-2.0-flash, claude-sonnet-4, doubao-lite-32k, gpt-3.5, gpt-4.1, o1, o3, o4, o1-mini, o3-mini, o4-mini, llama guard, gpt-4o, abab6.5s-chat-pro, grok-3, llamaguard-3-1b, llama-guard-3-8b, llama-guard-4-12b, llama-guard-3-11b-vision, gpt-3.5-turbo-1106, gpt-4o-2024-08-06, gpt-4.1-2025-04-14, o1-mini-2024-09-12, o1-2024-12-17, o3-mini-2025-01-31, o3-2025-04-16, o4-mini-2025-04-16, llama-guard-3-11b, abab5.5-chat-pro, abab5.5-chat, abab6.5s-chat, doubao-pro-32k, doubao-lite-128k, doubao-pro-256k, doubao-seed-1.6, doubao-seed-1.6-thinking, claude-sonnet-4-20250514, deepseek-reasoner, doubao-1.5-vision-pro, gemini-2.5-pro-preview-06-05, deepseek-chat, grok-2, deepseek-r1-250528, deepseek-v3-0324, moonshot-v1-32k, moonshot-v1-128k, gpt-4o-mini, abab6.5-chat, abab5.5s-chat-pro, yi-large, abab6.5g-chat, abab6.5t-chat, claude-3-5-sonnet-20241022, claude-3-7-sonnet-20250219, llama3-70b-8192, yi-large-turbo

LLM Interpreter Resource Exhaustion

8/16/2025

Large Language Models (LLMs) equipped with native code interpreters are vulnerable to Denial of Service (DoS) via resource exhaustion. An attacker can craft a single prompt that causes the interpreter to execute code that depletes CPU, memory, or disk resources. The vulnerability is particularly pronounced when a resource-intensive task is framed within a plausibly benign or socially-engineered context ("indirect prompts"), which significantly lowers the model's likelihood of refusal compared to explicitly malicious requests.

Running in CIRCLE? A Simple Benchmark for LLM Code Interpreter Security

Affects: gemini 2.0 flash, gemini 2.5 flash preview, gemini 2.5 pro preview, gpt-4.1 nano, gpt-4.1 mini, gpt-4.1, o4-mini, o3-pro

Variational Jailbreak Inference

7/14/2025

VERA, a variational inference framework, enables the generation of diverse and fluent adversarial prompts that bypass safety mechanisms in large language models (LLMs). The attacker model, trained through a variational objective, learns a distribution of prompts likely to elicit harmful responses, effectively jailbreaking the target LLM. This allows for the generation of novel attacks that are not based on pre-existing, manually crafted prompts.

VERA: Variational Inference Framework for Jailbreaking Large Language Models

Affects: llama2-7b-chat, llama2-13b-chat, vicuna-7b, baichuan-2-7b, orca2-7b, zephyr-7b-robust, gemini-pro, gpt-3.5-turbo1106, vicuna-7b chat, llama3-8b, mistral-7b, llama2-13b

LLM Self-Introspection Jailbreak

5/31/2025

A vulnerability exists in Large Language Models (LLMs) that allows attackers to manipulate the model's output by modifying token log probabilities. Attackers can use a lightweight plug-in model (BiasNet) to subtly alter the probabilities, steering the LLM toward generating harmful content even when safety mechanisms are in place. This attack requires only access to the top-k token log probabilities returned by the LLM's API, without needing model weights or internal access.

JULI: Jailbreak Large Language Models by Self-Introspection

Affects: llama3-3b-instruct, llama3-8b-instruct, llama2-7b-chat, qwen2-1.5b-instruct, llama3-8b-cb, llama3-1b-instruct, mistral-7b, qwen2.5-1.5b-inst, llama3-8b-inst

Dialogue History Jailbreak

3/19/2025

Large Language Models (LLMs) are vulnerable to Dialogue Injection Attacks (DIA), where malicious actors manipulate the chat history to bypass safety mechanisms and elicit harmful or unethical responses. DIA exploits the LLM's chat template structure to inject crafted dialogue into the input, even in black-box scenarios where the model's internals are unknown. Two attack methods are presented: one adapts gray-box prefilling attacks, the other leverages deferred responses to increase the likelihood of successful jailbreaks.

Dialogue Injection Attack: Jailbreaking LLMs through Context Manipulation

Affects: llama-3.1-8b, gpt-4o, llama-3, gemma-2-2b, gemma-2-9b, gemma-2-27b, qwen-2-7b, gpt-4o-mini, llama-3.2-11b, llama-2-7b, llama-3-70b, llama-3-8b

Schema-Guided LLM Jailbreak

4/12/2025

Large Language Models (LLMs) with structured output APIs (e.g., using JSON Schema) are vulnerable to Constrained Decoding Attacks (CDAs). CDAs exploit the control plane of the LLM's decoding process by embedding malicious intent within the schema-level grammar rules, bypassing safety mechanisms that primarily focus on input prompts. The attack manipulates the allowed output space, forcing the LLM to generate harmful content despite a benign input prompt. One instance of a CDA is the Chain Enum Attack, which leverages JSON Schema's enum feature to inject malicious options into the allowed output, achieving high success rates.

Output Constraints as Attack Surface: Exploiting Structured Generation to Bypass LLM Safety Mechanisms

Affects: gpt-4o, gpt-4o-mini, gemini-2.0-flash, phi-3.5-moe, mistral nemo, qwen-2.5-32b, llama-3.1-8b, gemma2-9b

Flowchart-based LVLM Jailbreak Attack

3/19/2025

FC-Attack leverages automatically generated flowcharts containing step-by-step descriptions derived or rephrased from harmful queries, combined with a benign textual prompt, to jailbreak Large Vision-Language Models (LVLMs). The vulnerability lies in the model's susceptibility to visual prompts containing harmful information within the flowcharts, thus bypassing safety alignment mechanisms.

FC-Attack: Jailbreaking Large Vision-Language Models via Auto-Generated Flowcharts

Affects: gemini-1.5 pro, llava-next, qwen2-vl, internvl-2.5, gpt-4o mini, gpt-4o, claude-3.5 sonnet, mistral 7b

Page 1 of 3