Methods for bypassing model safety measures
End-to-end Large Audio-Language Models (LALMs) are vulnerable to AudioJailbreak, a novel attack that appends adversarial audio perturbations ("jailbreak audios") to user prompts. These perturbations, even when applied asynchronously and without alignment to the user's speech, can manipulate the LALM's response to generate adversary-desired outputs that bypass safety mechanisms. The attack achieves universality by employing a single perturbation effective across different prompts and robustness to over-the-air transmission by incorporating reverberation effects during perturbation generation. Even with stealth strategies employed to mask malicious intent, the attack remains highly effective.
A vulnerability in SpeechGPT allows bypassing safety filters through adversarial audio prompts crafted by a white-box token-level attack. The attacker leverages knowledge of SpeechGPT's internal speech tokenization process to generate adversarial token sequences, which are then synthesized into audio. These audio prompts elicit restricted or harmful outputs the model would normally suppress. The attack's effectiveness relies on the model's discrete audio token representation and does not require access to model parameters or gradients.
Chain-of-thought (CoT) reasoning, while intended to improve safety, can paradoxically increase the harmfulness of successful jailbreak attacks by enabling the generation of highly detailed and actionable instructions. Existing jailbreaking methods, when applied to LLMs employing CoT, can elicit more precise and dangerous outputs than those from LLMs without CoT.
DNA language models, such as the Evo series, are vulnerable to jailbreak attacks that coerce the generation of DNA sequences with high homology to known human pathogens. The GeneBreaker framework demonstrates this by using a combination of carefully crafted prompts leveraging high-homology non-pathogenic sequences and a beam search guided by pathogenicity prediction models (e.g., PathoLM) and log-probability heuristics. This allows bypassing safety mechanisms and generating sequences exceeding 90% similarity to target pathogens.
GhostPrompt demonstrates a vulnerability in multimodal safety filters used with text-to-image generative models. The vulnerability allows attackers to bypass these filters by using a dynamic prompt optimization framework that iteratively generates adversarial prompts designed to evade both text-based and image-based safety checks while preserving the original, harmful intent of the prompt. This bypass is achieved through a combination of semantically aligned prompt rewriting and the injection of benign visual cues to confuse image-level filters.
Multimodal Large Language Models (MLLMs) used in Vision-and-Language Navigation (VLN) systems are vulnerable to jailbreak attacks. Adversarially crafted natural language instructions, even when disguised within seemingly benign prompts, can bypass safety mechanisms and cause the VLN agent to perform unintended or harmful actions in both simulated and real-world environments. The attacks exploit the MLLM's ability to follow instructions without sufficient consideration of the consequences of those actions.
Large Language Models (LLMs) are vulnerable to jailbreak attacks that exploit the model's inherent persuasive nature. A novel attack framework, CL-GSO, decomposes jailbreak strategies into four components (Role, Content Support, Context, Communication Skills), creating a significantly expanded strategy space compared to prior methods. This expanded space allows for the generation of prompts that bypass safety protocols with a success rate exceeding 90% on models previously considered resistant, such as Claude-3.5. The vulnerability lies in the susceptibility of the LLM's reasoning and response generation mechanisms to strategically crafted prompts leveraging these four components.
A vulnerability in several open-source Large Language Models (LLMs) allows attackers using exponentiated gradient descent to craft adversarial prompts that cause the models to generate harmful or unintended outputs, effectively "jailbreaking" the safety alignment mechanisms. The attack optimizes a continuous relaxed one-hot encoding of the input tokens, intrinsically satisfying constraints, and avoiding the need for projection techniques used in previous methods.
Large Language Models (LLMs) are vulnerable to a novel privacy jailbreak attack, dubbed PIG (Privacy Jailbreak Attack on LLMs via Gradient-based Iterative In-Context Optimization). PIG leverages in-context learning and gradient-based iterative optimization to extract Personally Identifiable Information (PII) from LLMs, bypassing built-in safety mechanisms. The attack iteratively refines a crafted prompt based on gradient information, focusing on tokens related to PII entities, thereby increasing the likelihood of successful PII extraction.
Multimodal large language models (MLLMs) are vulnerable to implicit jailbreak attacks that leverage least significant bit (LSB) steganography to conceal malicious instructions within images. These instructions are coupled with seemingly benign image-related text prompts, causing the MLLM to execute the hidden malicious instructions. The attack bypasses existing safety mechanisms by exploiting cross-modal reasoning capabilities.
© 2025 Promptfoo. All rights reserved.