Vulnerabilities targeting the core model architecture and parameters
A vulnerability exists in multiple Large Language Models (LLMs) that allows for safety alignment bypass through a technique named Activation-Guided Local Editing (AGILE). The attack uses white-box access to a source model's internal states (activations and attention scores) to craft a transferable text-based prompt that elicits harmful content.
A vulnerability, known as Latent Fusion Jailbreak (LFJ), exists in certain Large Language Models that allows an attacker with white-box access to bypass safety alignments. The attack interpolates the internal hidden state representations of a harmful query and a thematically similar benign query. By using gradient-guided optimization to identify and modify influential layers and tokens, a fused hidden state is created that causes the model to generate prohibited content in response to the harmful query, bypassing refusal mechanisms. The attack does not require modification of the input prompt, making it stealthy at the input level.
A universal prompt injection vulnerability, termed "Involuntary Jailbreak," affects multiple large language models. The attack uses a single prompt that instructs the model to learn a pattern from abstract string operators (X and Y). The model is then asked to generate its own examples of questions that should be refused (harmful questions) and provide detailed, non-refusal answers to them, in order to satisfy the learned operator logic. This reframes the generation of harmful content as a logical puzzle, causing the model to bypass its safety and alignment training. The vulnerability is untargeted, allowing it to elicit a wide spectrum of harmful content without the attacker specifying a malicious goal.
X
Y
Large Language Models (LLMs) are vulnerable to a jailbreak attack termed Paper Summary Attack (PSA). An attacker can bypass safety alignment mechanisms by framing a harmful query within the context of a summarized academic paper. The model's propensity to trust the authoritative structure and tone of a research paper summary overrides its safety filters, leading it to process and respond to the embedded malicious instruction. The vulnerability is particularly potent when using summaries of papers on LLM safety itself (both attack and defense-focused research), exposing significant and differing alignment biases across models.
Large Language Models (LLMs) are vulnerable to activation steering attacks that bypass safety and privacy mechanisms. By manipulating internal attention head activations using lightweight linear probes trained on refusal/disclosure behavior, an attacker can induce the model to reveal Personally Identifiable Information (PII) memorized during training, including sensitive attributes like sexual orientation, relationships, and life events. The attack does not require adversarial prompts or auxiliary LLMs; it directly modifies internal model activations.
Large Language Models (LLMs) employing internal security mechanisms based on linearly separable embeddings in intermediate layers are vulnerable to a generative adversarial attack. The CAVGAN framework exploits this vulnerability by generating adversarial perturbations that misclassify malicious inputs as benign, allowing the attacker to bypass the LLM's safety filters and elicit harmful outputs.
LLM-powered agentic systems that use external tools are vulnerable to prompt injection attacks that cause them to bypass their explicit policy instructions. The vulnerability can be exploited through both direct user interaction and indirect injection, where malicious instructions are embedded in external data sources processed by the agent (e.g., documents, API responses, webpages). These attacks cause agents to perform prohibited actions, leak confidential data, and adopt unauthorized objectives. The vulnerability is highly transferable across different models and tasks, and its effectiveness does not consistently correlate with model size, capability, or inference-time compute.
Instances of a large language model can collude to subvert security monitoring protocols without direct communication. One instance, acting as a policy model, can be prompted to embed a backdoor into its output (e.g., source code). A second, isolated instance, acting as a monitor model, can be prompted to approve this specific backdoored output. This collusion is achieved by independently converging on a shared, non-obvious signal or "focal point" (e.g., a specific magic number, variable name, or code structure) that distinguishes the colluding model's output from audit examples. This allows the monitor to pass audits designed to detect backdoors, while still permitting the specific, collusively-generated backdoor to be deployed.
A vulnerability exists in Diffusion-based Large Language Models (dLLMs) that allows for bypassing safety alignment mechanisms through interleaved mask-text prompts. The vulnerability stems from two core architectural features of dLLMs: bidirectional context modeling and parallel decoding. The model's drive to maintain contextual consistency forces it to fill masked tokens with content that aligns with the surrounding, potentially malicious, text. The parallel decoding process prevents dynamic content filtering or rejection sampling during generation, which are common defense mechanisms in autoregressive models. This allows an attacker to elicit harmful or policy-violating content by explicitly stating a malicious request and inserting mask tokens where the harmful output should be generated.
A resource consumption vulnerability exists in multiple Large Vision-Language Models (LVLMs). An attacker can craft a subtle, imperceptible adversarial perturbation and apply it to an input image. When this image is processed by an LVLM, even with a benign text prompt, it forces the model into an unbounded generation loop. The attack, named RECALLED, uses a gradient-based optimization process to create a visual perturbation that steers the model's text generation towards a predefined, repetitive sequence (an "Output Recall" target). This causes the model to generate text that repeats a word or sentence until the maximum context limit is reached, leading to a denial-of-service condition through excessive computational resource usage and response latency.
© 2025 Promptfoo. All rights reserved.