Model Layer Vulnerabilities

Vulnerabilities targeting the core model architecture and parameters

Related Vulnerabilities

134 entries

Academic Paper Trust Jailbreak

7/28/2025

Large Language Models (LLMs) are vulnerable to a jailbreak attack termed Paper Summary Attack (PSA). An attacker can bypass safety alignment mechanisms by framing a harmful query within the context of a summarized academic paper. The model's propensity to trust the authoritative structure and tone of a research paper summary overrides its safety filters, leading it to process and respond to the embedded malicious instruction. The vulnerability is particularly potent when using summaries of papers on LLM safety itself (both attack and defense-focused research), exposing significant and differing alignment biases across models.

Paper Summary Attack: Jailbreaking LLMs through LLM Safety Papers

Affects: deepseek-r1, claude3.5-sonnet, gpt-4o, llama3.1-8b-instruct, vicuna-7b-v1.5, llama2-7b-chat-hf

Activation Steering Leaks PII

7/14/2025

Large Language Models (LLMs) are vulnerable to activation steering attacks that bypass safety and privacy mechanisms. By manipulating internal attention head activations using lightweight linear probes trained on refusal/disclosure behavior, an attacker can induce the model to reveal Personally Identifiable Information (PII) memorized during training, including sensitive attributes like sexual orientation, relationships, and life events. The attack does not require adversarial prompts or auxiliary LLMs; it directly modifies internal model activations.

PII Jailbreaking in LLMs via Activation Steering Reveals Personal Information Leakage

Affects: llama 7b, qwen 7b, gemma 9b, glm 9b, gpt-4, gpt-4 mini, llama 2-7b

Adversarial LLM Internal Attack

7/14/2025

Large Language Models (LLMs) employing internal security mechanisms based on linearly separable embeddings in intermediate layers are vulnerable to a generative adversarial attack. The CAVGAN framework exploits this vulnerability by generating adversarial perturbations that misclassify malicious inputs as benign, allowing the attacker to bypass the LLM's safety filters and elicit harmful outputs.

CAVGAN: Unifying Jailbreak and Defense of LLMs via Generative Adversarial Attacks on their Internal Representations

Affects: llama 3.1-8b, qwen2.5-7b, mistral-8b, qwen2.5-14b, qwen2.5-32b

LVLM Vision-Triggered Resource Exhaustion

7/28/2025

A resource consumption vulnerability exists in multiple Large Vision-Language Models (LVLMs). An attacker can craft a subtle, imperceptible adversarial perturbation and apply it to an input image. When this image is processed by an LVLM, even with a benign text prompt, it forces the model into an unbounded generation loop. The attack, named RECALLED, uses a gradient-based optimization process to create a visual perturbation that steers the model's text generation towards a predefined, repetitive sequence (an "Output Recall" target). This causes the model to generate text that repeats a word or sentence until the maximum context limit is reached, leading to a denial-of-service condition through excessive computational resource usage and response latency.

RECALLED: An Unbounded Resource Consumption Attack on Large Vision-Language Models

Affects: llava, qwen-vl, instructblip, qwen3b, qwen7b, qwen32b, llava7b, llava13b, blip7b, blip13b, llava-1.5-hf, qwen/qwen2.5-vl-instruct, instructblip-vicuna, meta-llama

Agentic Red-Teaming Uncovers Novel Jailbreaks

7/28/2025

Large Language Models (LLMs) are vulnerable to jailbreaking through an agentic attack framework called Composition of Principles (CoP). This technique uses an attacker LLM (Red-Teaming Agent) to dynamically select and combine multiple human-defined, high-level transformations ("principles") into a single, sophisticated prompt. The composition of several simple principles, such as expanding context, rephrasing, and inserting specific phrases, creates complex adversarial prompts that can bypass safety and alignment mechanisms designed to block single-tactic or more direct harmful requests. This allows an attacker to elicit policy-violating or harmful content in a single turn.

CoP: Agentic Red-teaming for Large Language Models using Composition of Principles

Affects: llama-2-7b-chat, llama-2-13b-chat, llama-2-70b-chat, llama-3-8b-instruct, llama-3-8b-instruct-rr, llama-3-70b-instruct, llama-3-8b-chat, gemma-7b-it, gpt-4-1106-preview, gpt-4-turbo-1106, gemini pro 1.5, openai o1, claude-3.5 sonnet, grok-2, gpt-4, gpt-4o, llama-2-13b

Bitstream Camouflage Jailbreak

7/14/2025

A novel black-box attack, dubbed BitBypass, exploits the vulnerability of aligned LLMs by camouflaging harmful prompts using hyphen-separated bitstreams. This bypasses safety alignment mechanisms by transforming sensitive words into their bitstream representations and replacing them with placeholders, in conjunction with a specially crafted system prompt that instructs the LLM to convert the bitstream back to text and respond as if given the original harmful prompt.

BitBypass: A New Direction in Jailbreaking Aligned Large Language Models with Bitstream Camouflage

Affects: gpt-4o, gemini 1.5 pro, claude 3.5 sonnet, llama 3.1 70b, mixtral 8x22b

Distilled Jailbreak Attacks

7/14/2025

A vulnerability in Large Language Models (LLMs) allows adversarial prompt distillation from a large language model (LLM) to a smaller language model (SLM), enabling efficient and stealthy jailbreak attacks. The attack leverages knowledge distillation techniques, reinforcement learning, and dynamic temperature control to transfer the LLM's ability to bypass safety mechanisms to a smaller, more easily deployable SLM. This allows for lower computational cost attacks with a potentially high success rate.

Efficient and Stealthy Jailbreak Attacks via Adversarial Prompt Distillation from LLMs to SLMs

Affects: gpt-4, gpt-3.5-turbo, llama-2, vicuna-7b, bert-base-uncased, llama-3.2-1b, llama-2-7b, llama-2-13b, vicuna-13b, gpt-4o, gemma2 2b, gemma2 27b, llama-3.1

Hybrid LLM Jailbreak Strategy

7/14/2025

A hybrid jailbreak attack, combining gradient-guided token optimization (GCG) with iterative prompt refinement (PAIR or WordGame+), bypasses LLM safety mechanisms resulting in the generation of disallowed content. The hybrid approach leverages the strengths of both techniques, circumventing defenses effective against single-mode attacks. Specifically, the combination of semantically crafted prompts and strategically placed adversarial tokens confuse and overwhelm existing defenses.

Advancing Jailbreak Strategies: A Hybrid Approach to Exploiting LLM Vulnerabilities and Bypassing Modern Defenses

Affects: vicuna-7b, llama 2, llama 3, gpt-3.5 (gpt-3.5-turbo), gpt-4 (gpt-4-0314), claude-1, claude-2, mistral-sorry-bench, meta-llama-guard-2-8b, deepseek-r1:70b, llama-3-8b, vicuna-7b-v1.5, llama2-7b

Stealthy Unlearning Degradation

6/30/2025

A vulnerability in fine-tuning-based large language model (LLM) unlearning allows malicious actors to craft manipulated forgetting requests. By subtly increasing the frequency of common benign tokens within the forgetting data, the attacker can cause the unlearned model to exhibit unintended unlearning behaviors when these benign tokens appear in normal user prompts, leading to a degradation of model utility for legitimate users. This occurs because existing unlearning methods fail to effectively distinguish between benign tokens and those truly related to the target knowledge being unlearned.

Keeping an eye on llm unlearning: The hidden risk and remedy

Affects: llama 3.1 (8b), mistral v0.3 (7b)

Twin Prompt Jailbreak

7/14/2025

A white-box vulnerability allows attackers with full model access to bypass LLM safety alignments by identifying and pruning parameters responsible for rejecting harmful prompts. The attack leverages a novel "twin prompt" technique to differentiate safety-related parameters from those essential for model utility, performing fine-grained pruning with minimal impact on overall model functionality.

TwinBreak: Jailbreaking LLM Security Alignments based on Twin Prompts

Affects: llama 2 7b, llama 2 13b, llama 2 70b, llama 3.1 8b, llama 3.3 70b, gemma 2 2b, gemma 2 9b, gemma 2 27b, gemma 3 1b, qwen 2.5 3b, qwen 2.5 7b, qwen 2.5 14b, qwen 2.5 32b, qwen 2.5 72b, mistral 7b, deepseek 7b

Page 1 of 14