Vulnerabilities in prompt handling and processing
Large Language Models (LLMs) employing safety filters designed to prevent generation of content related to self-harm and suicide can be bypassed through multi-step adversarial prompting. By reframing the request as an academic exercise or hypothetical scenario, users can elicit detailed instructions and information that could facilitate self-harm or suicide, despite initially expressing harmful intent. This vulnerability lies in the inadequacy of existing safety filters to consistently recognize and prevent harmful outputs despite shifts in conversational context.
Large Language Models (LLMs) are vulnerable to obfuscation-based jailbreak attacks using the MetaCipher framework. MetaCipher employs a reinforcement learning algorithm to iteratively select from a pool of 21 ciphers to encrypt malicious keywords within prompts, evading standard safety mechanisms that rely on keyword detection. The framework adaptively learns optimal cipher choices to maximize the success rate of the jailbreak, even against LLMs with reasoning capabilities. Successful attacks bypass safety guardrails, leading to the execution of malicious requests masked as benign input.
Large Language Models (LLMs) are vulnerable to a novel adversarial attack, Alphabet Index Mapping (AIM), which achieves high success rates in bypassing safety filters ("jailbreaking"). AIM encodes prompts by converting characters to their alphabet indices, maximizing semantic dissimilarity while maintaining straightforward decoding instructions. This allows malicious prompts to evade detection based on semantic similarity, even when the LLM correctly decodes the intent.
A vulnerability in Large Language Models (LLMs) allows adversarial prompt distillation from a large language model (LLM) to a smaller language model (SLM), enabling efficient and stealthy jailbreak attacks. The attack leverages knowledge distillation techniques, reinforcement learning, and dynamic temperature control to transfer the LLM's ability to bypass safety mechanisms to a smaller, more easily deployable SLM. This allows for lower computational cost attacks with a potentially high success rate.
The MIST attack exploits a vulnerability in black-box large language models (LLMs) allowing iterative semantic tuning of prompts to elicit harmful responses. The attack leverages synonym substitution and optimization strategies to bypass safety mechanisms without requiring access to the model's internal parameters or weights. The vulnerability lies in the susceptibility of the LLM to semantically similar prompts that trigger unsafe outputs.
VERA, a variational inference framework, enables the generation of diverse and fluent adversarial prompts that bypass safety mechanisms in large language models (LLMs). The attacker model, trained through a variational objective, learns a distribution of prompts likely to elicit harmful responses, effectively jailbreaking the target LLM. This allows for the generation of novel attacks that are not based on pre-existing, manually crafted prompts.
Large Language Models (LLMs) are vulnerable to adaptive jailbreaking attacks that exploit their semantic comprehension capabilities. The MEF framework demonstrates that by tailoring attacks to the model's understanding level (Type I or Type II), evasion of input, inference, and output-level defenses is significantly improved. This is achieved through layered semantic mutations and dual-ended encryption techniques, allowing bypass of security measures even in advanced models like GPT-4o.
End-to-end Large Audio-Language Models (LALMs) are vulnerable to AudioJailbreak, a novel attack that appends adversarial audio perturbations ("jailbreak audios") to user prompts. These perturbations, even when applied asynchronously and without alignment to the user's speech, can manipulate the LALM's response to generate adversary-desired outputs that bypass safety mechanisms. The attack achieves universality by employing a single perturbation effective across different prompts and robustness to over-the-air transmission by incorporating reverberation effects during perturbation generation. Even with stealth strategies employed to mask malicious intent, the attack remains highly effective.
Chain-of-thought (CoT) reasoning, while intended to improve safety, can paradoxically increase the harmfulness of successful jailbreak attacks by enabling the generation of highly detailed and actionable instructions. Existing jailbreaking methods, when applied to LLMs employing CoT, can elicit more precise and dangerous outputs than those from LLMs without CoT.
GhostPrompt demonstrates a vulnerability in multimodal safety filters used with text-to-image generative models. The vulnerability allows attackers to bypass these filters by using a dynamic prompt optimization framework that iteratively generates adversarial prompts designed to evade both text-based and image-based safety checks while preserving the original, harmful intent of the prompt. This bypass is achieved through a combination of semantically aligned prompt rewriting and the injection of benign visual cues to confuse image-level filters.
© 2025 Promptfoo. All rights reserved.