Vulnerabilities targeting the core model architecture and parameters
A vulnerability exists where non-autoregressive Diffusion Language Models (DLLMs) can be leveraged to generate highly effective and transferable adversarial prompts against autoregressive LLMs. The technique, named INPAINTING, reframes the resource-intensive search for adversarial prompts into an efficient, amortized inference task. By providing a desired harmful or restricted response to a DLLM, the model can conditionally generate a corresponding low-perplexity prompt that elicits that response from a wide range of target models. The generated prompts often reframe the malicious request into a benign-appearing context (e.g., asking for an example of harmful content for educational purposes), making them difficult to detect via standard perplexity filters.
A vulnerability in Large Language Models (LLMs) allows for systematic jailbreaking through a meta-optimization framework called AMIS (Align to MISalign). The attack uses a bi-level optimization process to co-evolve both the jailbreak prompts and the scoring templates used to evaluate them.
A jailbreak vulnerability, known as Task Concurrency, exists in multiple Large Language Models (LLMs). The vulnerability arises when two distinct tasks, one harmful and one benign, are interleaved at the word level within a single prompt. The structure of the malicious prompt alternates words from each task, often using separators like {} to encapsulate words from the second task. This "concurrent" instruction format obfuscates the harmful intent from the model's safety guardrails, causing the LLM to process and generate a response to the harmful query, which it would otherwise refuse. The attacker can then extract the harmful content from the model's interleaved output.
{}
A vulnerability exists in Large Language Models (LLMs) that support fine-tuning, allowing an attacker to bypass safety alignments using a small, benign dataset. The attack, "Attack via Overfitting," is a two-stage process. In Stage 1, the model is fine-tuned on a small set of benign questions (e.g., 10) paired with identical, repetitive refusal answers. This induces an overfitted state where the model learns to refuse all prompts, creating a sharp minimum in the loss landscape and making it highly sensitive to parameter changes. In Stage 2, the overfitted model is further fine-tuned on the same benign questions, but with their standard, helpful answers. This second fine-tuning step causes catastrophic forgetting of the general refusal behavior, leading to a collapse of safety alignment and causing the model to comply with harmful and malicious instructions. The attack is highly stealthy as the fine-tuning data appears benign to content moderation systems.
Appending simple demographic persona details to prompts requesting policy-violating content can bypass the safety mechanisms of Large Language Models. This technique, referred to as persona-targeted prompting, adds details such as country, generation, and political orientation to a request for a harmful narrative (e.g., disinformation). This systematically increases the jailbreak rate across most tested models and languages, in some cases by over 10 percentage points, enabling the generation of harmful content that would otherwise be refused.
Large Language Models (LLMs) are vulnerable to jailbreak attacks that use persuasive techniques grounded in social psychology to bypass safety alignments. Malicious instructions can be reframed using one of Cialdini's seven principles of persuasion (Authority, Reciprocity, Commitment, Social Proof, Liking, Scarcity, and Unity). These rephrased prompts, which remain human-readable and can be generated automatically, manipulate the LLM into complying with harmful requests it would otherwise refuse. The attack's effectiveness varies by principle and by model, revealing distinct "persuasive fingerprints" of susceptibility.
Large Language Models (LLMs) that use special tokens to define conversational structure (e.g., via chat templates) are vulnerable to a jailbreak attack named MetaBreak. An attacker can inject these special tokens, or regular tokens with high semantic similarity in the embedding space, into a user prompt. This manipulation allows the attacker to bypass the model's internal safety alignment and external content moderation systems. The attack leverages four primitives:
A vulnerability exists in certain safety-aligned Large Language Models (LLMs) due to an untargeted, gradient-based optimization attack method called Untargeted Jailbreak Attack (UJA). Unlike previous targeted attacks (e.g., GCG) that optimize a prompt to elicit a predefined string (e.g., "Sure, here is..."), UJA optimizes for a general objective: maximizing the unsafety probability of the model's response, as quantified by an external judge model.
unsafety
Large Language Models from multiple vendors are vulnerable to a "Camouflaged Jailbreak" attack. Malicious instructions are embedded within seemingly benign, technically complex prompts, often framed as system design or engineering problems. The models fail to recognize the harmful intent implied by the context and technical specifications, bypassing safety filters that rely on detecting explicit keywords. This leads to the generation of detailed, technically plausible instructions for creating dangerous devices or systems. The attack has a high success rate, with models demonstrating full obedience in over 94% of tested cases, treating the harmful requests as legitimate.
© 2025 Promptfoo. All rights reserved.