Issues affecting model consistency and dependability
A resource consumption vulnerability exists in multiple Large Vision-Language Models (LVLMs). An attacker can craft a subtle, imperceptible adversarial perturbation and apply it to an input image. When this image is processed by an LVLM, even with a benign text prompt, it forces the model into an unbounded generation loop. The attack, named RECALLED, uses a gradient-based optimization process to create a visual perturbation that steers the model's text generation towards a predefined, repetitive sequence (an "Output Recall" target). This causes the model to generate text that repeats a word or sentence until the maximum context limit is reached, leading to a denial-of-service condition through excessive computational resource usage and response latency.
Large Language Models (LLMs) are vulnerable to obfuscation-based jailbreak attacks using the MetaCipher framework. MetaCipher employs a reinforcement learning algorithm to iteratively select from a pool of 21 ciphers to encrypt malicious keywords within prompts, evading standard safety mechanisms that rely on keyword detection. The framework adaptively learns optimal cipher choices to maximize the success rate of the jailbreak, even against LLMs with reasoning capabilities. Successful attacks bypass safety guardrails, leading to the execution of malicious requests masked as benign input.
A vulnerability in multi-agent Large Language Model (LLM) systems allows for a permutation-invariant adversarial prompt attack. By strategically partitioning adversarial prompts and routing them through a network topology, an attacker can bypass distributed safety mechanisms, even those with token bandwidth limitations and asynchronous message delivery. The attack optimizes prompt propagation as a maximum-flow minimum-cost problem, maximizing success while minimizing detection.
Large Language Models (LLMs) designed for step-by-step problem-solving are vulnerable to query-agnostic adversarial triggers. Appending short, semantically irrelevant text snippets (e.g., "Interesting fact: cats sleep most of their lives") to mathematical problems consistently increases the likelihood of incorrect model outputs without altering the problem's inherent meaning. This vulnerability stems from the models' susceptibility to subtle input manipulations that interfere with their internal reasoning processes.
Large Language Models (LLMs) are vulnerable to jailbreak attacks by crafted prompts that bypass safety mechanisms, causing the model to generate harmful or unethical content. This vulnerability stems from the inherent tension between the LLM's instruction-following and safety constraints. The JBFuzz technique demonstrates the ability to efficiently and effectively discover such prompts through a fuzzing-based approach leveraging novel seed prompt templates and a synonym-based mutation strategy.
Large Language Model (LLM) safety judges exhibit vulnerability to adversarial attacks and stylistic prompt modifications, leading to increased false negative rates (FNR) and decreased accuracy in classifying harmful model outputs. Minor stylistic changes to model outputs, such as altering the formatting or tone, can significantly impact a judge's classification, while direct adversarial modifications to the generated text can fool judges into misclassifying even 100% of harmful generations as safe. This vulnerability impacts the reliability of LLM safety evaluations used in offline benchmarking, automated red-teaming, and online guardrails.
The Knowledge-Distilled Attacker (KDA) model, when used to generate prompts for large language models (LLMs), can bypass LLM safety mechanisms resulting in the generation of harmful, inappropriate, or misaligned content. KDA's effectiveness stems from its ability to generate diverse and coherent attack prompts efficiently, surpassing existing methods in attack success rate and speed. The vulnerability lies in the LLMs' insufficient defenses against the diverse prompt generation strategies learned and employed by KDA.
FC-Attack leverages automatically generated flowcharts containing step-by-step descriptions derived or rephrased from harmful queries, combined with a benign textual prompt, to jailbreak Large Vision-Language Models (LVLMs). The vulnerability lies in the model's susceptibility to visual prompts containing harmful information within the flowcharts, thus bypassing safety alignment mechanisms.
A vulnerability exists in Large Language Models (LLMs) that allows for efficient jailbreaking by selectively fine-tuning only the lower layers of the model with a toxic dataset. This "Freeze Training" method, as described in the research paper, concentrates the fine-tuning on layers identified as being highly sensitive to the generation of harmful content. This approach significantly reduces training duration and GPU memory consumption while maintaining a high jailbreak success rate.
Large Language Models (LLMs) employing reinforcement learning from human feedback (RLHF) for safety alignment are vulnerable to a novel "alignment-based" jailbreak attack. This attack leverages a best-of-N sampling approach with an adversarial LLM to efficiently generate prompts that bypass safety mechanisms and elicit unsafe responses from the target LLM, without requiring additional training or access to the target LLM's internal parameters. The attack exploits the inherent tension between safety and unsafe reward signals, effectively misaligning the model via alignment techniques.
© 2025 Promptfoo. All rights reserved.