Jailbreak Vulnerabilities

Methods for bypassing model safety measures

Related Vulnerabilities

355 entries

Academic Paper Trust Jailbreak

7/28/2025

Large Language Models (LLMs) are vulnerable to a jailbreak attack termed Paper Summary Attack (PSA). An attacker can bypass safety alignment mechanisms by framing a harmful query within the context of a summarized academic paper. The model's propensity to trust the authoritative structure and tone of a research paper summary overrides its safety filters, leading it to process and respond to the embedded malicious instruction. The vulnerability is particularly potent when using summaries of papers on LLM safety itself (both attack and defense-focused research), exposing significant and differing alignment biases across models.

Paper Summary Attack: Jailbreaking LLMs through LLM Safety Papers

Affects: deepseek-r1, claude3.5-sonnet, gpt-4o, llama3.1-8b-instruct, vicuna-7b-v1.5, llama2-7b-chat-hf

Activation Steering Leaks PII

7/14/2025

Large Language Models (LLMs) are vulnerable to activation steering attacks that bypass safety and privacy mechanisms. By manipulating internal attention head activations using lightweight linear probes trained on refusal/disclosure behavior, an attacker can induce the model to reveal Personally Identifiable Information (PII) memorized during training, including sensitive attributes like sexual orientation, relationships, and life events. The attack does not require adversarial prompts or auxiliary LLMs; it directly modifies internal model activations.

PII Jailbreaking in LLMs via Activation Steering Reveals Personal Information Leakage

Affects: llama 7b, qwen 7b, gemma 9b, glm 9b, gpt-4, gpt-4 mini, llama 2-7b

Adversarial LLM Internal Attack

7/14/2025

Large Language Models (LLMs) employing internal security mechanisms based on linearly separable embeddings in intermediate layers are vulnerable to a generative adversarial attack. The CAVGAN framework exploits this vulnerability by generating adversarial perturbations that misclassify malicious inputs as benign, allowing the attacker to bypass the LLM's safety filters and elicit harmful outputs.

CAVGAN: Unifying Jailbreak and Defense of LLMs via Generative Adversarial Attacks on their Internal Representations

Affects: llama 3.1-8b, qwen2.5-7b, mistral-8b, qwen2.5-14b, qwen2.5-32b

LLM Suicide Prompt Jailbreak

7/14/2025

Large Language Models (LLMs) employing safety filters designed to prevent generation of content related to self-harm and suicide can be bypassed through multi-step adversarial prompting. By reframing the request as an academic exercise or hypothetical scenario, users can elicit detailed instructions and information that could facilitate self-harm or suicide, despite initially expressing harmful intent. This vulnerability lies in the inadequacy of existing safety filters to consistently recognize and prevent harmful outputs despite shifts in conversational context.

For Argument's Sake, Show Me How to Harm Myself!': Jailbreaking LLMs in Suicide and Self-Harm Contexts

Affects: chat-gpt4o*, chat-gpt4o, perplexityai, gemini flash 2.0, claude 3.7 sonnet, pi ai

Trojan Prompt Chains in Education

7/28/2025

A vulnerability exists in Large Language Models, including GPT-3.5 and GPT-4, where safety guardrails can be bypassed using Trojanized prompt chains within a simulated educational context. An attacker can establish a benign, pedagogical persona (e.g., a curious student) over a multi-turn dialogue. This initial context is then exploited to escalate the conversation toward requests for harmful or restricted information, which the model provides because the session's context is perceived as safe. The vulnerability stems from the moderation system's failure to detect semantic escalation and topic drift within an established conversational context. Two primary methods were identified: Simulated Child Confusion (SCC), which uses a naive persona to ask for dangerous information under a moral frame (e.g., "what not to do"), and Prompt Chain Escalation via Literary Devices (PCELD), which frames harmful concepts as an academic exercise in satire or metaphor.

Mitigating Trojanized Prompt Chains in Educational LLM Use Cases: Experimental Findings and Detection Tool Design

Affects: gpt-3.5, gpt-4, bert

Visual Jailbreak via Context Injection

7/14/2025

Multimodal Large Language Models (MLLMs) are vulnerable to visual contextual attacks, where carefully crafted images and accompanying text prompts can bypass safety mechanisms and elicit harmful responses. The vulnerability stems from the MLLM's ability to integrate visual and textual context to generate outputs, allowing attackers to create realistic scenarios that subvert safety filters. Specifically, the attack leverages image-driven context injection to construct deceptive multi-turn conversations that gradually lead the MLLM to produce unsafe responses.

Visual Contextual Attack: Jailbreaking MLLMs with Image-Driven Context Injection

Affects: gpt-4o, gpt-4o-mini, gemini 2.0-flash, internvl2.5-78b, llava-ov-7b-chat, qwen2.5-vl-72b-instruct

Adaptive Cipher Jailbreak

7/14/2025

Large Language Models (LLMs) are vulnerable to obfuscation-based jailbreak attacks using the MetaCipher framework. MetaCipher employs a reinforcement learning algorithm to iteratively select from a pool of 21 ciphers to encrypt malicious keywords within prompts, evading standard safety mechanisms that rely on keyword detection. The framework adaptively learns optimal cipher choices to maximize the success rate of the jailbreak, even against LLMs with reasoning capabilities. Successful attacks bypass safety guardrails, leading to the execution of malicious requests masked as benign input.

MetaCipher: A General and Extensible Reinforcement Learning Framework for Obfuscation-Based Jailbreak Attacks on Black-Box LLMs

Affects: falcon3-10b-instruct, internlm2.5-20b-chat, llama3.3-70b-instruct, qwen2.5-72b-instruct, claude-3.7-sonnet-20250209, deepseek-chat, gemini-2.0-flash-001, gpt-4o-2024-11-20, qwq-32b, deepseekreasoner (r1), gemini-2.5-pro-exp-03-25, o1-mini-2024-09-12

Agentic Red-Teaming Uncovers Novel Jailbreaks

7/28/2025

Large Language Models (LLMs) are vulnerable to jailbreaking through an agentic attack framework called Composition of Principles (CoP). This technique uses an attacker LLM (Red-Teaming Agent) to dynamically select and combine multiple human-defined, high-level transformations ("principles") into a single, sophisticated prompt. The composition of several simple principles, such as expanding context, rephrasing, and inserting specific phrases, creates complex adversarial prompts that can bypass safety and alignment mechanisms designed to block single-tactic or more direct harmful requests. This allows an attacker to elicit policy-violating or harmful content in a single turn.

CoP: Agentic Red-teaming for Large Language Models using Composition of Principles

Affects: llama-2-7b-chat, llama-2-13b-chat, llama-2-70b-chat, llama-3-8b-instruct, llama-3-8b-instruct-rr, llama-3-70b-instruct, llama-3-8b-chat, gemma-7b-it, gpt-4-1106-preview, gpt-4-turbo-1106, gemini pro 1.5, openai o1, claude-3.5 sonnet, grok-2, gpt-4, gpt-4o, llama-2-13b

Alphabet Index Jailbreak

7/14/2025

Large Language Models (LLMs) are vulnerable to a novel adversarial attack, Alphabet Index Mapping (AIM), which achieves high success rates in bypassing safety filters ("jailbreaking"). AIM encodes prompts by converting characters to their alphabet indices, maximizing semantic dissimilarity while maintaining straightforward decoding instructions. This allows malicious prompts to evade detection based on semantic similarity, even when the LLM correctly decodes the intent.

Alphabet Index Mapping: Jailbreaking LLMs through Semantic Dissimilarity

Bitstream Camouflage Jailbreak

7/14/2025

A novel black-box attack, dubbed BitBypass, exploits the vulnerability of aligned LLMs by camouflaging harmful prompts using hyphen-separated bitstreams. This bypasses safety alignment mechanisms by transforming sensitive words into their bitstream representations and replacing them with placeholders, in conjunction with a specially crafted system prompt that instructs the LLM to convert the bitstream back to text and respond as if given the original harmful prompt.

BitBypass: A New Direction in Jailbreaking Aligned Large Language Models with Bitstream Camouflage

Affects: gpt-4o, gemini 1.5 pro, claude 3.5 sonnet, llama 3.1 70b, mixtral 8x22b

Page 1 of 36