Prompt Leaking LLM Vulnerabilities

Enterprise Multi-Turn Data Exfiltration

7/28/2025

Large Language Model (LLM) systems integrated with private enterprise data, such as those using Retrieval-Augmented Generation (RAG), are vulnerable to multi-stage prompt inference attacks. An attacker can use a sequence of individually benign-looking queries to incrementally extract confidential information from the LLM's context. Each query appears innocuous in isolation, bypassing safety filters designed to block single malicious prompts. By chaining these queries, the attacker can reconstruct sensitive data from internal documents, emails, or other private sources accessible to the LLM. The attack exploits the conversational context and the model's inability to recognize the cumulative intent of a prolonged, strategic dialogue.

Multi-Stage Prompt Inference Attacks on Enterprise LLM Systems

Affects: gpt-4, gpt-3, gpt-2, roberta, gemini

LLM Multi-Agent IP Leakage

5/31/2025

Large Language Model (LLM)-based Multi-Agent Systems (MAS) are vulnerable to intellectual property (IP) leakage attacks. An attacker with black-box access (only interacting via the public API) can craft adversarial queries that propagate through the MAS, extracting sensitive information such as system prompts, task instructions, tool specifications, number of agents, and system topology.

IP Leakage Attacks Targeting LLM-Based Multi-Agent Systems

Affects: gpt-4o, gpt-4o-mini, llama-3.1-70b, qwen-2.5-72b, llama-3.1-8b

Agentic Prompt Leakage Attacks

3/4/2025

A vulnerability exists in large language models (LLMs) where insufficient sanitization of system prompts allows attackers to extract sensitive information embedded within those prompts. Attackers can use an agentic approach, employing multiple interacting LLMs (as demonstrated in the referenced research), to iteratively refine prompts and elicit confidential data from the target LLM's responses. The vulnerability is exacerbated by the LLM's ability to infer context from seemingly innocuous prompts.

Automating Prompt Leakage Attacks on Large Language Models Using Agentic Approach

Affects: chatgpt-4omini

Agent Action Hijacking

3/19/2025

CVE-2024-XXXX

Towards Action Hijacking of Large Language Model-based Agent

Affects: llama, vicuna, qwen2, alpaca, gpt-3, gpt-4, minilm, m3e, bert

Logit-Forced Knowledge Extraction

12/29/2024

Large Language Models (LLMs) with accessible output logits are vulnerable to "coercive interrogation," a novel attack that extracts harmful knowledge hidden in low-ranked tokens. The attack doesn't require crafted prompts; instead, it iteratively forces the LLM to select and output low-probability tokens at key positions in the response sequence, revealing toxic content the model would otherwise suppress.

Make them spill the beans! coercive knowledge extraction from (production) llms

Affects: yi-34b, vicuna-13b, llama2-7b, llama2-13b, llama2-70b, codellama-13b-instruct, codellama-13b-python, gpt-3.5-turbo-instruct, gpt-3.5-turbo-instruct-0914, text-davinci-003

Logit-Forced Knowledge Extraction

12/28/2024

Large Language Models (LLMs) with accessible output logits are vulnerable to "coercive interrogation," a novel attack that extracts harmful knowledge hidden in low-ranked tokens. The attack doesn't require crafted prompts; instead, it iteratively forces the LLM to select and output low-probability tokens at key positions in the response sequence, revealing toxic content the model would otherwise suppress.

Make them spill the beans! coercive knowledge extraction from (production) llms

Affects: yi-34b, vicuna-13b, llama2-7b, llama2-13b, llama2-70b, codellama-13b-instruct, codellama-13b-python, gpt-3.5-turbo-instruct, gpt-3.5-turbo-instruct-0914, text-davinci-003

GPT-4v System Prompt Leakage

12/29/2024

A system prompt leakage vulnerability in GPT-4V allows extraction of internal system prompts through carefully crafted, incomplete conversations combined with image input. Extracted prompts can be used as highly effective jailbreak prompts, bypassing safety restrictions and leading to undesirable outputs, including revealing personally identifiable information from images.

Jailbreaking gpt-4v via self-adversarial attacks with system prompts

Affects: gpt-4v, gpt-4, llava-1.5v

GPT-4v System Prompt Leakage

12/28/2024

A system prompt leakage vulnerability in GPT-4V allows extraction of internal system prompts through carefully crafted, incomplete conversations combined with image input. Extracted prompts can be used as highly effective jailbreak prompts, bypassing safety restrictions and leading to undesirable outputs, including revealing personally identifiable information from images.

Jailbreaking gpt-4v via self-adversarial attacks with system prompts

Affects: gpt-4v, gpt-4, llava-1.5v

AutoDAN: Interpretable LLM Jailbreak

12/29/2024

AutoDAN is an interpretable gradient-based adversarial attack that generates readable prompts to bypass perplexity filters and jailbreak LLMs. The attack crafts prompts that elicit harmful behaviors while maintaining sufficient readability to avoid detection by existing perplexity-based defenses. This is achieved through a left-to-right token-by-token generation process optimizing for both jailbreaking success and prompt readability.

Autodan: Automatic and interpretable adversarial attacks on large language models

Affects: vicuna-7b, vicuna-13b, guanaco-7b, pythia-12b, gpt-3.5-turbo, gpt-4, llama2-chat

AutoDAN: Interpretable LLM Jailbreak

12/28/2024

AutoDAN is an interpretable gradient-based adversarial attack that generates readable prompts to bypass perplexity filters and jailbreak LLMs. The attack crafts prompts that elicit harmful behaviors while maintaining sufficient readability to avoid detection by existing perplexity-based defenses. This is achieved through a left-to-right token-by-token generation process optimizing for both jailbreaking success and prompt readability.

Autodan: Automatic and interpretable adversarial attacks on large language models

Affects: vicuna-7b, vicuna-13b, guanaco-7b, pythia-12b, gpt-3.5-turbo, gpt-4, llama2-chat