Vulnerabilities in prompt handling and processing
Large Language Models (LLMs) are vulnerable to a jailbreak attack termed Paper Summary Attack (PSA). An attacker can bypass safety alignment mechanisms by framing a harmful query within the context of a summarized academic paper. The model's propensity to trust the authoritative structure and tone of a research paper summary overrides its safety filters, leading it to process and respond to the embedded malicious instruction. The vulnerability is particularly potent when using summaries of papers on LLM safety itself (both attack and defense-focused research), exposing significant and differing alignment biases across models.
Large Language Model (LLM) systems integrated with private enterprise data, such as those using Retrieval-Augmented Generation (RAG), are vulnerable to multi-stage prompt inference attacks. An attacker can use a sequence of individually benign-looking queries to incrementally extract confidential information from the LLM's context. Each query appears innocuous in isolation, bypassing safety filters designed to block single malicious prompts. By chaining these queries, the attacker can reconstruct sensitive data from internal documents, emails, or other private sources accessible to the LLM. The attack exploits the conversational context and the model's inability to recognize the cumulative intent of a prolonged, strategic dialogue.
Large Language Models (LLMs) employing safety filters designed to prevent generation of content related to self-harm and suicide can be bypassed through multi-step adversarial prompting. By reframing the request as an academic exercise or hypothetical scenario, users can elicit detailed instructions and information that could facilitate self-harm or suicide, despite initially expressing harmful intent. This vulnerability lies in the inadequacy of existing safety filters to consistently recognize and prevent harmful outputs despite shifts in conversational context.
A resource consumption vulnerability exists in multiple Large Vision-Language Models (LVLMs). An attacker can craft a subtle, imperceptible adversarial perturbation and apply it to an input image. When this image is processed by an LVLM, even with a benign text prompt, it forces the model into an unbounded generation loop. The attack, named RECALLED, uses a gradient-based optimization process to create a visual perturbation that steers the model's text generation towards a predefined, repetitive sequence (an "Output Recall" target). This causes the model to generate text that repeats a word or sentence until the maximum context limit is reached, leading to a denial-of-service condition through excessive computational resource usage and response latency.
A vulnerability exists in Large Language Models, including GPT-3.5 and GPT-4, where safety guardrails can be bypassed using Trojanized prompt chains within a simulated educational context. An attacker can establish a benign, pedagogical persona (e.g., a curious student) over a multi-turn dialogue. This initial context is then exploited to escalate the conversation toward requests for harmful or restricted information, which the model provides because the session's context is perceived as safe. The vulnerability stems from the moderation system's failure to detect semantic escalation and topic drift within an established conversational context. Two primary methods were identified: Simulated Child Confusion (SCC), which uses a naive persona to ask for dangerous information under a moral frame (e.g., "what not to do"), and Prompt Chain Escalation via Literary Devices (PCELD), which frames harmful concepts as an academic exercise in satire or metaphor.
Large Language Models (LLMs) are vulnerable to obfuscation-based jailbreak attacks using the MetaCipher framework. MetaCipher employs a reinforcement learning algorithm to iteratively select from a pool of 21 ciphers to encrypt malicious keywords within prompts, evading standard safety mechanisms that rely on keyword detection. The framework adaptively learns optimal cipher choices to maximize the success rate of the jailbreak, even against LLMs with reasoning capabilities. Successful attacks bypass safety guardrails, leading to the execution of malicious requests masked as benign input.
Large Language Models (LLMs) are vulnerable to jailbreaking through an agentic attack framework called Composition of Principles (CoP). This technique uses an attacker LLM (Red-Teaming Agent) to dynamically select and combine multiple human-defined, high-level transformations ("principles") into a single, sophisticated prompt. The composition of several simple principles, such as expanding context, rephrasing, and inserting specific phrases, creates complex adversarial prompts that can bypass safety and alignment mechanisms designed to block single-tactic or more direct harmful requests. This allows an attacker to elicit policy-violating or harmful content in a single turn.
Large Language Models (LLMs) are vulnerable to a novel adversarial attack, Alphabet Index Mapping (AIM), which achieves high success rates in bypassing safety filters ("jailbreaking"). AIM encodes prompts by converting characters to their alphabet indices, maximizing semantic dissimilarity while maintaining straightforward decoding instructions. This allows malicious prompts to evade detection based on semantic similarity, even when the LLM correctly decodes the intent.
A vulnerability in Large Language Models (LLMs) allows adversarial prompt distillation from a large language model (LLM) to a smaller language model (SLM), enabling efficient and stealthy jailbreak attacks. The attack leverages knowledge distillation techniques, reinforcement learning, and dynamic temperature control to transfer the LLM's ability to bypass safety mechanisms to a smaller, more easily deployable SLM. This allows for lower computational cost attacks with a potentially high success rate.
The MIST attack exploits a vulnerability in black-box large language models (LLMs) allowing iterative semantic tuning of prompts to elicit harmful responses. The attack leverages synonym substitution and optimization strategies to bypass safety mechanisms without requiring access to the model's internal parameters or weights. The vulnerability lies in the susceptibility of the LLM to semantically similar prompts that trigger unsafe outputs.
© 2025 Promptfoo. All rights reserved.