Techniques for extracting sensitive information from models
Large Language Model (LLM) safety judges exhibit vulnerability to adversarial attacks and stylistic prompt modifications, leading to increased false negative rates (FNR) and decreased accuracy in classifying harmful model outputs. Minor stylistic changes to model outputs, such as altering the formatting or tone, can significantly impact a judge's classification, while direct adversarial modifications to the generated text can fool judges into misclassifying even 100% of harmful generations as safe. This vulnerability impacts the reliability of LLM safety evaluations used in offline benchmarking, automated red-teaming, and online guardrails.
This vulnerability allows attackers to identify the presence and location (input or output stage) of specific guardrails implemented in Large Language Models (LLMs) by using carefully crafted adversarial prompts. The attack, termed AP-Test, leverages a tailored loss function to optimize these prompts, maximizing the likelihood of triggering a specific guardrail while minimizing triggering others. Successful identification provides attackers with valuable information to design more effective attacks that evade the identified guardrails.
A vulnerability exists in large language models (LLMs) where insufficient sanitization of system prompts allows attackers to extract sensitive information embedded within those prompts. Attackers can use an agentic approach, employing multiple interacting LLMs (as demonstrated in the referenced research), to iteratively refine prompts and elicit confidential data from the target LLM's responses. The vulnerability is exacerbated by the LLM's ability to infer context from seemingly innocuous prompts.
Large Language Models (LLMs) are vulnerable to one-shot steering vector optimization attacks. By applying gradient descent to a single training example, an attacker can generate steering vectors that induce or suppress specific behaviors across multiple inputs, even those unseen during the optimization process. This allows malicious actors to manipulate the model's output in a generalized way, bypassing safety mechanisms designed to prevent harmful responses.
Large Language Models (LLMs) used in hate speech detection systems are vulnerable to adversarial attacks and model stealing, resulting in evasion of hate speech detection. Adversarial attacks modify hate speech text to evade detection, while model stealing creates surrogate models that mimic the target system's behavior.
Large Language Models (LLMs) employing alignment techniques for safety embed a "safety classifier" within their architecture. This classifier, responsible for determining whether an input is safe or unsafe, can be approximated by extracting a surrogate classifier from a subset of the LLM's architecture. Attackers can leverage this surrogate classifier to more effectively craft adversarial inputs (jailbreaks) that bypass the LLM's intended safety mechanisms. The attack success rate against the surrogate classifier is significantly higher than directly attacking the full LLM, requiring fewer computational resources.
Large Language Models (LLMs) are vulnerable to attacks that generate obfuscated activations, bypassing latent-space defenses such as sparse autoencoders, representation probing, and latent out-of-distribution (OOD) detection. Attackers can manipulate model inputs or training data to produce outputs exhibiting malicious behavior while remaining undetected by these defenses. This occurs because the models can represent harmful behavior through diverse activation patterns, allowing attackers to exploit inconspicuous latent states.
Large Language Models (LLMs) are vulnerable to a novel agentic-based red-teaming attack, PrivAgent, which uses reinforcement learning to generate adversarial prompts. These prompts can extract sensitive information, including system prompts and portions of training data, from target LLMs even with existing guardrail defenses. The attack leverages a custom reward function based on a normalized sliding-window word edit similarity metric to guide the learning process, enabling it to overcome the limitations of previous fuzzing and genetic approaches.
Large Language Models (LLMs) are vulnerable to jailbreaking attacks that manipulate attention scores to redirect the model's focus away from safety protocols. The AttnGCG attack method increases the attention score on adversarial suffixes within the input prompt, causing the model to prioritize the malicious content over safety guidelines, leading to the generation of harmful outputs.
Jailbreaking vulnerabilities in Large Language Models (LLMs) used in Retrieval-Augmented Generation (RAG) systems allow escalation of attacks from entity extraction to full document extraction and enable the propagation of self-replicating malicious prompts ("worms") within interconnected RAG applications. Exploitation leverages prompt injection to force the LLM to return retrieved documents or execute malicious actions specified within the prompt.
© 2025 Promptfoo. All rights reserved.