Attacks that expose sensitive prompt information
A vulnerability exists in large language models (LLMs) where insufficient sanitization of system prompts allows attackers to extract sensitive information embedded within those prompts. Attackers can use an agentic approach, employing multiple interacting LLMs (as demonstrated in the referenced research), to iteratively refine prompts and elicit confidential data from the target LLM's responses. The vulnerability is exacerbated by the LLM's ability to infer context from seemingly innocuous prompts.
CVE-2024-XXXX
Large Language Models (LLMs) with accessible output logits are vulnerable to "coercive interrogation," a novel attack that extracts harmful knowledge hidden in low-ranked tokens. The attack doesn't require crafted prompts; instead, it iteratively forces the LLM to select and output low-probability tokens at key positions in the response sequence, revealing toxic content the model would otherwise suppress.
A system prompt leakage vulnerability in GPT-4V allows extraction of internal system prompts through carefully crafted, incomplete conversations combined with image input. Extracted prompts can be used as highly effective jailbreak prompts, bypassing safety restrictions and leading to undesirable outputs, including revealing personally identifiable information from images.
AutoDAN is an interpretable gradient-based adversarial attack that generates readable prompts to bypass perplexity filters and jailbreak LLMs. The attack crafts prompts that elicit harmful behaviors while maintaining sufficient readability to avoid detection by existing perplexity-based defenses. This is achieved through a left-to-right token-by-token generation process optimizing for both jailbreaking success and prompt readability.
© 2025 Promptfoo. All rights reserved.