Extraction Vulnerabilities

Techniques for extracting sensitive information from models

Related Vulnerabilities

43 entries

Activation Steering Leaks PII

7/14/2025

Large Language Models (LLMs) are vulnerable to activation steering attacks that bypass safety and privacy mechanisms. By manipulating internal attention head activations using lightweight linear probes trained on refusal/disclosure behavior, an attacker can induce the model to reveal Personally Identifiable Information (PII) memorized during training, including sensitive attributes like sexual orientation, relationships, and life events. The attack does not require adversarial prompts or auxiliary LLMs; it directly modifies internal model activations.

PII Jailbreaking in LLMs via Activation Steering Reveals Personal Information Leakage

Affects: llama 7b, qwen 7b, gemma 9b, glm 9b, gpt-4, gpt-4 mini, llama 2-7b

DNA Model Pathogen Synthesis

6/11/2025

DNA language models, such as the Evo series, are vulnerable to jailbreak attacks that coerce the generation of DNA sequences with high homology to known human pathogens. The GeneBreaker framework demonstrates this by using a combination of carefully crafted prompts leveraging high-homology non-pathogenic sequences and a beam search guided by pathogenicity prediction models (e.g., PathoLM) and log-probability heuristics. This allows bypassing safety mechanisms and generating sequences exceeding 90% similarity to target pathogens.

GeneBreaker: Jailbreak Attacks against DNA Language Models with Pathogenicity Guidance

Affects: evo1 (7b), evo2 (1b), evo2 (7b), evo2 (40b), chatgpt-4o, patholm

Gradient-Based Privacy Jailbreak

5/31/2025

Large Language Models (LLMs) are vulnerable to a novel privacy jailbreak attack, dubbed PIG (Privacy Jailbreak Attack on LLMs via Gradient-based Iterative In-Context Optimization). PIG leverages in-context learning and gradient-based iterative optimization to extract Personally Identifiable Information (PII) from LLMs, bypassing built-in safety mechanisms. The attack iteratively refines a crafted prompt based on gradient information, focusing on tokens related to PII entities, thereby increasing the likelihood of successful PII extraction.

PIG: Privacy Jailbreak Attack on LLMs via Gradient-based Iterative In-Context Optimization

Affects: llama2-7b-chat-hf, mistral-7b-instruct-v0.3, llama3-8b-instruct, vicuna-7b-v1.5, gpt-4o, claude 3.5

LLM Multi-Agent IP Leakage

5/31/2025

Large Language Model (LLM)-based Multi-Agent Systems (MAS) are vulnerable to intellectual property (IP) leakage attacks. An attacker with black-box access (only interacting via the public API) can craft adversarial queries that propagate through the MAS, extracting sensitive information such as system prompts, task instructions, tool specifications, number of agents, and system topology.

IP Leakage Attacks Targeting LLM-Based Multi-Agent Systems

Affects: gpt-4o, gpt-4o-mini, llama-3.1-70b, qwen-2.5-72b, llama-3.1-8b

LLM Censorship Vector Control

5/4/2025

Large Language Models (LLMs) employing safety mechanisms based on supervised fine-tuning and preference alignment exhibit a vulnerability to "steering" attacks. Maliciously crafted prompts or input manipulations can exploit representation vectors within the model to either bypass censorship ("refusal-compliance vector") or suppress the model's reasoning process ("thought suppression vector"), resulting in the generation of unintended or harmful outputs. This vulnerability is demonstrated across several instruction-tuned and reasoning LLMs from various providers.

Steering the CensorShip: Uncovering Representation Vectors for LLM" Thought" Control

Affects: llama-2-7b, qwen-1.8b, qwen-7b, yi-1.5-6b, gemma-2b, gemma-7b, llama-3.1-8b, qwen 2.5-7b, deepseek-r1-distill-qwen-1.5b, deepseek-r1-distill-qwen-7b, deepseek-r1-distill-qwen-32b

LLM Judge Adversarial Vulnerability

3/19/2025

Large Language Model (LLM) safety judges exhibit vulnerability to adversarial attacks and stylistic prompt modifications, leading to increased false negative rates (FNR) and decreased accuracy in classifying harmful model outputs. Minor stylistic changes to model outputs, such as altering the formatting or tone, can significantly impact a judge's classification, while direct adversarial modifications to the generated text can fool judges into misclassifying even 100% of harmful generations as safe. This vulnerability impacts the reliability of LLM safety evaluations used in offline benchmarking, automated red-teaming, and online guardrails.

Know Thy Judge: On the Robustness Meta-Evaluation of LLM Safety Judges

Affects: llama-2 13b, llama guard 3 8b, wildguard, shieldgemma 9b, mistral 7b, llama-3.1 8b, atla selene mini 8b

AP-Test Guardrail Identification

3/4/2025

This vulnerability allows attackers to identify the presence and location (input or output stage) of specific guardrails implemented in Large Language Models (LLMs) by using carefully crafted adversarial prompts. The attack, termed AP-Test, leverages a tailored loss function to optimize these prompts, maximizing the likelihood of triggering a specific guardrail while minimizing triggering others. Successful identification provides attackers with valuable information to design more effective attacks that evade the identified guardrails.

Peering Behind the Shield: Guardrail Identification in Large Language Models

Affects: gpt-4o, llama 3.1, gemini 2-9b, chatgpt, llamaguard, llamaguard2, llamaguard3, aegisdefensive, aegispermissive, shieldgemma-2b, shieldgemma-9b, shieldgemma-27b, perspective, gpt4o

Agentic Prompt Leakage Attacks

3/4/2025

A vulnerability exists in large language models (LLMs) where insufficient sanitization of system prompts allows attackers to extract sensitive information embedded within those prompts. Attackers can use an agentic approach, employing multiple interacting LLMs (as demonstrated in the referenced research), to iteratively refine prompts and elicit confidential data from the target LLM's responses. The vulnerability is exacerbated by the LLM's ability to infer context from seemingly innocuous prompts.

Automating Prompt Leakage Attacks on Large Language Models Using Agentic Approach

Affects: chatgpt-4omini

One-Shot LLM Steering Attack

3/4/2025

Large Language Models (LLMs) are vulnerable to one-shot steering vector optimization attacks. By applying gradient descent to a single training example, an attacker can generate steering vectors that induce or suppress specific behaviors across multiple inputs, even those unseen during the optimization process. This allows malicious actors to manipulate the model's output in a generalized way, bypassing safety mechanisms designed to prevent harmful responses.

Investigating Generalization of One-shot LLM Steering Vectors

Affects: llama-13b, llama-2-13b, llama-3.1-8b-instruct, gemma-2-2b-it, gemma-2-2b

LLM Hate Campaign Vulnerability

2/2/2025

Large Language Models (LLMs) used in hate speech detection systems are vulnerable to adversarial attacks and model stealing, resulting in evasion of hate speech detection. Adversarial attacks modify hate speech text to evade detection, while model stealing creates surrogate models that mimic the target system's behavior.

HateBench: Benchmarking Hate Speech Detectors on LLM-Generated Content and Hate Campaigns

Affects: gpt-3.5, gpt-4, vicuna, baichuan2, dolly2, opt

Page 1 of 5