Issues affecting model consistency and dependability
A vulnerability in multi-agent Large Language Model (LLM) systems allows for a permutation-invariant adversarial prompt attack. By strategically partitioning adversarial prompts and routing them through a network topology, an attacker can bypass distributed safety mechanisms, even those with token bandwidth limitations and asynchronous message delivery. The attack optimizes prompt propagation as a maximum-flow minimum-cost problem, maximizing success while minimizing detection.
Large Language Models (LLMs) designed for step-by-step problem-solving are vulnerable to query-agnostic adversarial triggers. Appending short, semantically irrelevant text snippets (e.g., "Interesting fact: cats sleep most of their lives") to mathematical problems consistently increases the likelihood of incorrect model outputs without altering the problem's inherent meaning. This vulnerability stems from the models' susceptibility to subtle input manipulations that interfere with their internal reasoning processes.
Large Language Models (LLMs) are vulnerable to jailbreak attacks by crafted prompts that bypass safety mechanisms, causing the model to generate harmful or unethical content. This vulnerability stems from the inherent tension between the LLM's instruction-following and safety constraints. The JBFuzz technique demonstrates the ability to efficiently and effectively discover such prompts through a fuzzing-based approach leveraging novel seed prompt templates and a synonym-based mutation strategy.
Large Language Model (LLM) safety judges exhibit vulnerability to adversarial attacks and stylistic prompt modifications, leading to increased false negative rates (FNR) and decreased accuracy in classifying harmful model outputs. Minor stylistic changes to model outputs, such as altering the formatting or tone, can significantly impact a judge's classification, while direct adversarial modifications to the generated text can fool judges into misclassifying even 100% of harmful generations as safe. This vulnerability impacts the reliability of LLM safety evaluations used in offline benchmarking, automated red-teaming, and online guardrails.
The Knowledge-Distilled Attacker (KDA) model, when used to generate prompts for large language models (LLMs), can bypass LLM safety mechanisms resulting in the generation of harmful, inappropriate, or misaligned content. KDA's effectiveness stems from its ability to generate diverse and coherent attack prompts efficiently, surpassing existing methods in attack success rate and speed. The vulnerability lies in the LLMs' insufficient defenses against the diverse prompt generation strategies learned and employed by KDA.
FC-Attack leverages automatically generated flowcharts containing step-by-step descriptions derived or rephrased from harmful queries, combined with a benign textual prompt, to jailbreak Large Vision-Language Models (LVLMs). The vulnerability lies in the model's susceptibility to visual prompts containing harmful information within the flowcharts, thus bypassing safety alignment mechanisms.
A vulnerability exists in Large Language Models (LLMs) that allows for efficient jailbreaking by selectively fine-tuning only the lower layers of the model with a toxic dataset. This "Freeze Training" method, as described in the research paper, concentrates the fine-tuning on layers identified as being highly sensitive to the generation of harmful content. This approach significantly reduces training duration and GPU memory consumption while maintaining a high jailbreak success rate.
Large Language Models (LLMs) employing reinforcement learning from human feedback (RLHF) for safety alignment are vulnerable to a novel "alignment-based" jailbreak attack. This attack leverages a best-of-N sampling approach with an adversarial LLM to efficiently generate prompts that bypass safety mechanisms and elicit unsafe responses from the target LLM, without requiring additional training or access to the target LLM's internal parameters. The attack exploits the inherent tension between safety and unsafe reward signals, effectively misaligning the model via alignment techniques.
A poisoning attack against a Retrieval-Augmented Generation (RAG) system that manipulates the retriever component by injecting a poisoned document into the data used by the embedding model. This poisoned document contains modified and incorrect information. When activated, the system retrieves the poisoned document and uses it to generate misleading, biased, and unfaithful responses to user queries.
Large language models (LLMs) are vulnerable to adversarial suffix injection attacks. Maliciously crafted suffixes appended to otherwise benign prompts can cause the LLM to generate harmful or undesired outputs, bypassing built-in safety mechanisms. The attack leverages the model's sensitivity to input perturbations to elicit responses outside its intended safety boundaries.
© 2025 Promptfoo. All rights reserved.