Vulnerabilities in model API implementations
Large Language Models (LLMs) are vulnerable to Dialogue Injection Attacks (DIA), where malicious actors manipulate the chat history to bypass safety mechanisms and elicit harmful or unethical responses. DIA exploits the LLM's chat template structure to inject crafted dialogue into the input, even in black-box scenarios where the model's internals are unknown. Two attack methods are presented: one adapts gray-box prefilling attacks, the other leverages deferred responses to increase the likelihood of successful jailbreaks.
Large Language Models (LLMs) with structured output APIs (e.g., using JSON Schema) are vulnerable to Constrained Decoding Attacks (CDAs). CDAs exploit the control plane of the LLM's decoding process by embedding malicious intent within the schema-level grammar rules, bypassing safety mechanisms that primarily focus on input prompts. The attack manipulates the allowed output space, forcing the LLM to generate harmful content despite a benign input prompt. One instance of a CDA is the Chain Enum Attack, which leverages JSON Schema's enum feature to inject malicious options into the allowed output, achieving high success rates.
enum
FC-Attack leverages automatically generated flowcharts containing step-by-step descriptions derived or rephrased from harmful queries, combined with a benign textual prompt, to jailbreak Large Vision-Language Models (LVLMs). The vulnerability lies in the model's susceptibility to visual prompts containing harmful information within the flowcharts, thus bypassing safety alignment mechanisms.
Large Language Models (LLMs) with structured output interfaces are vulnerable to jailbreak attacks that exploit the interaction between token-level inference and sentence-level safety alignment. Attackers can manipulate the model's output by constructing attack patterns based on prefixes of safety refusal responses and desired harmful outputs, effectively bypassing safety mechanisms through iterative API calls and constrained decoding. This allows the generation of harmful content despite safety measures.
CVE-2024-XXXX
A novel black-box attack framework leverages fuzz testing to automatically generate concise and semantically coherent prompts that bypass safety mechanisms in large language models (LLMs), eliciting harmful or offensive responses. The attack starts with an empty seed pool, utilizes LLM-assisted mutation strategies (Role-play, Contextualization, Expand), and employs a two-level judge module for efficient identification of successful jailbreaks. The attack's effectiveness is demonstrated across several open-source and proprietary LLMs, exceeding existing baselines by over 60% in some cases.
A Cross-Prompt Injection Attack (XPIA) can be amplified by appending a Greedy Coordinate Gradient (GCG) suffix to the malicious injection. This increases the likelihood that a Large Language Model (LLM) will execute the injected instruction, even in the presence of a user's primary instruction, leading to data exfiltration. The success rate of the attack depends on the LLM's complexity; medium-complexity models show increased vulnerability.
Large Language Model (LLM)-based Code Completion Tools (LCCTs), such as GitHub Copilot and Amazon Q, are vulnerable to jailbreaking and training data extraction attacks due to their unique workflows and reliance on proprietary code datasets. Jailbreaking attacks exploit the LLM's ability to generate harmful content by embedding malicious prompts within various code components (filenames, comments, variable names, function calls). Training data extraction attacks leverage the LLM's tendency to memorize training data, allowing extraction of sensitive information like email addresses and physical addresses from the proprietary dataset.
Large Language Models (LLMs) are vulnerable to a novel black-box jailbreaking attack, ECLIPSE, which leverages the LLM's own capabilities as an optimizer to generate adversarial suffixes. ECLIPSE iteratively refines these suffixes based on a harmfulness score, bypassing the need for pre-defined affirmative phrases used in previous optimization-based attacks. This allows for effective jailbreaking even with limited interaction and without white-box access to the LLM's internal parameters.
© 2025 Promptfoo. All rights reserved.