Integrity Vulnerabilities

Vulnerabilities impacting model output reliability

Related Vulnerabilities

271 entries

Academic Paper Trust Jailbreak

7/28/2025

Large Language Models (LLMs) are vulnerable to a jailbreak attack termed Paper Summary Attack (PSA). An attacker can bypass safety alignment mechanisms by framing a harmful query within the context of a summarized academic paper. The model's propensity to trust the authoritative structure and tone of a research paper summary overrides its safety filters, leading it to process and respond to the embedded malicious instruction. The vulnerability is particularly potent when using summaries of papers on LLM safety itself (both attack and defense-focused research), exposing significant and differing alignment biases across models.

Paper Summary Attack: Jailbreaking LLMs through LLM Safety Papers

Affects: deepseek-r1, claude3.5-sonnet, gpt-4o, llama3.1-8b-instruct, vicuna-7b-v1.5, llama2-7b-chat-hf

Adversarial LLM Internal Attack

7/14/2025

Large Language Models (LLMs) employing internal security mechanisms based on linearly separable embeddings in intermediate layers are vulnerable to a generative adversarial attack. The CAVGAN framework exploits this vulnerability by generating adversarial perturbations that misclassify malicious inputs as benign, allowing the attacker to bypass the LLM's safety filters and elicit harmful outputs.

CAVGAN: Unifying Jailbreak and Defense of LLMs via Generative Adversarial Attacks on their Internal Representations

Affects: llama 3.1-8b, qwen2.5-7b, mistral-8b, qwen2.5-14b, qwen2.5-32b

LVLM Vision-Triggered Resource Exhaustion

7/28/2025

A resource consumption vulnerability exists in multiple Large Vision-Language Models (LVLMs). An attacker can craft a subtle, imperceptible adversarial perturbation and apply it to an input image. When this image is processed by an LVLM, even with a benign text prompt, it forces the model into an unbounded generation loop. The attack, named RECALLED, uses a gradient-based optimization process to create a visual perturbation that steers the model's text generation towards a predefined, repetitive sequence (an "Output Recall" target). This causes the model to generate text that repeats a word or sentence until the maximum context limit is reached, leading to a denial-of-service condition through excessive computational resource usage and response latency.

RECALLED: An Unbounded Resource Consumption Attack on Large Vision-Language Models

Affects: llava, qwen-vl, instructblip, qwen3b, qwen7b, qwen32b, llava7b, llava13b, blip7b, blip13b, llava-1.5-hf, qwen/qwen2.5-vl-instruct, instructblip-vicuna, meta-llama

Trojan Prompt Chains in Education

7/28/2025

A vulnerability exists in Large Language Models, including GPT-3.5 and GPT-4, where safety guardrails can be bypassed using Trojanized prompt chains within a simulated educational context. An attacker can establish a benign, pedagogical persona (e.g., a curious student) over a multi-turn dialogue. This initial context is then exploited to escalate the conversation toward requests for harmful or restricted information, which the model provides because the session's context is perceived as safe. The vulnerability stems from the moderation system's failure to detect semantic escalation and topic drift within an established conversational context. Two primary methods were identified: Simulated Child Confusion (SCC), which uses a naive persona to ask for dangerous information under a moral frame (e.g., "what not to do"), and Prompt Chain Escalation via Literary Devices (PCELD), which frames harmful concepts as an academic exercise in satire or metaphor.

Mitigating Trojanized Prompt Chains in Educational LLM Use Cases: Experimental Findings and Detection Tool Design

Affects: gpt-3.5, gpt-4, bert

Visual Jailbreak via Context Injection

7/14/2025

Multimodal Large Language Models (MLLMs) are vulnerable to visual contextual attacks, where carefully crafted images and accompanying text prompts can bypass safety mechanisms and elicit harmful responses. The vulnerability stems from the MLLM's ability to integrate visual and textual context to generate outputs, allowing attackers to create realistic scenarios that subvert safety filters. Specifically, the attack leverages image-driven context injection to construct deceptive multi-turn conversations that gradually lead the MLLM to produce unsafe responses.

Visual Contextual Attack: Jailbreaking MLLMs with Image-Driven Context Injection

Affects: gpt-4o, gpt-4o-mini, gemini 2.0-flash, internvl2.5-78b, llava-ov-7b-chat, qwen2.5-vl-72b-instruct

Agentic Red-Teaming Uncovers Novel Jailbreaks

7/28/2025

Large Language Models (LLMs) are vulnerable to jailbreaking through an agentic attack framework called Composition of Principles (CoP). This technique uses an attacker LLM (Red-Teaming Agent) to dynamically select and combine multiple human-defined, high-level transformations ("principles") into a single, sophisticated prompt. The composition of several simple principles, such as expanding context, rephrasing, and inserting specific phrases, creates complex adversarial prompts that can bypass safety and alignment mechanisms designed to block single-tactic or more direct harmful requests. This allows an attacker to elicit policy-violating or harmful content in a single turn.

CoP: Agentic Red-teaming for Large Language Models using Composition of Principles

Affects: llama-2-7b-chat, llama-2-13b-chat, llama-2-70b-chat, llama-3-8b-instruct, llama-3-8b-instruct-rr, llama-3-70b-instruct, llama-3-8b-chat, gemma-7b-it, gpt-4-1106-preview, gpt-4-turbo-1106, gemini pro 1.5, openai o1, claude-3.5 sonnet, grok-2, gpt-4, gpt-4o, llama-2-13b

Alphabet Index Jailbreak

7/14/2025

Large Language Models (LLMs) are vulnerable to a novel adversarial attack, Alphabet Index Mapping (AIM), which achieves high success rates in bypassing safety filters ("jailbreaking"). AIM encodes prompts by converting characters to their alphabet indices, maximizing semantic dissimilarity while maintaining straightforward decoding instructions. This allows malicious prompts to evade detection based on semantic similarity, even when the LLM correctly decodes the intent.

Alphabet Index Mapping: Jailbreaking LLMs through Semantic Dissimilarity

Bitstream Camouflage Jailbreak

7/14/2025

A novel black-box attack, dubbed BitBypass, exploits the vulnerability of aligned LLMs by camouflaging harmful prompts using hyphen-separated bitstreams. This bypasses safety alignment mechanisms by transforming sensitive words into their bitstream representations and replacing them with placeholders, in conjunction with a specially crafted system prompt that instructs the LLM to convert the bitstream back to text and respond as if given the original harmful prompt.

BitBypass: A New Direction in Jailbreaking Aligned Large Language Models with Bitstream Camouflage

Affects: gpt-4o, gemini 1.5 pro, claude 3.5 sonnet, llama 3.1 70b, mixtral 8x22b

Distilled Jailbreak Attacks

7/14/2025

A vulnerability in Large Language Models (LLMs) allows adversarial prompt distillation from a large language model (LLM) to a smaller language model (SLM), enabling efficient and stealthy jailbreak attacks. The attack leverages knowledge distillation techniques, reinforcement learning, and dynamic temperature control to transfer the LLM's ability to bypass safety mechanisms to a smaller, more easily deployable SLM. This allows for lower computational cost attacks with a potentially high success rate.

Efficient and Stealthy Jailbreak Attacks via Adversarial Prompt Distillation from LLMs to SLMs

Affects: gpt-4, gpt-3.5-turbo, llama-2, vicuna-7b, bert-base-uncased, llama-3.2-1b, llama-2-7b, llama-2-13b, vicuna-13b, gpt-4o, gemma2 2b, gemma2 27b, llama-3.1

Hybrid LLM Jailbreak Strategy

7/14/2025

A hybrid jailbreak attack, combining gradient-guided token optimization (GCG) with iterative prompt refinement (PAIR or WordGame+), bypasses LLM safety mechanisms resulting in the generation of disallowed content. The hybrid approach leverages the strengths of both techniques, circumventing defenses effective against single-mode attacks. Specifically, the combination of semantically crafted prompts and strategically placed adversarial tokens confuse and overwhelm existing defenses.

Advancing Jailbreak Strategies: A Hybrid Approach to Exploiting LLM Vulnerabilities and Bypassing Modern Defenses

Affects: vicuna-7b, llama 2, llama 3, gpt-3.5 (gpt-3.5-turbo), gpt-4 (gpt-4-0314), claude-1, claude-2, mistral-sorry-bench, meta-llama-guard-2-8b, deepseek-r1:70b, llama-3-8b, vicuna-7b-v1.5, llama2-7b

Page 1 of 28