LMVD-ID: abc865e7
Published February 1, 2025

Prefix-Tree Jailbreak

Affected Models:llama2-7bchat, llama2-13b-chat, mistral-7b-instruct, qwen7b-chat, qwen-14b-chat, llama2-13b-cls, deepseek-r1-distill-qwen7b, deepseek-r1-distill-qwen14b

Research Paper

Exploiting Prefix-Tree in Structured Output Interfaces for Enhancing Jailbreak Attacking

View Paper

Description: Large Language Models (LLMs) with structured output interfaces are vulnerable to jailbreak attacks that exploit the interaction between token-level inference and sentence-level safety alignment. Attackers can manipulate the model's output by constructing attack patterns based on prefixes of safety refusal responses and desired harmful outputs, effectively bypassing safety mechanisms through iterative API calls and constrained decoding. This allows the generation of harmful content despite safety measures.

Examples: The AttackPrefixTree (APT) framework, detailed in the referenced paper, provides a concrete example of this attack. The attack leverages structured output interfaces to iteratively construct a tree, exploring prefixes of both safe and harmful outputs. By using constrained decoding, the attacker suppresses the model's safety responses, ultimately guiding the model towards generating harmful content. See arXiv:2405.18540 for implementation details.

Impact: Successful exploitation can lead to the generation of harmful, unethical, or illegal content by the LLM, bypassing its built-in safety mechanisms. This can have severe consequences depending on the LLM's application, ranging from the dissemination of misinformation to the provision of instructions for illegal activities.

Affected Systems: LLMs that provide structured output interfaces (e.g., JSON, YAML, regex constraints) and employ sentence-level safety mechanisms are vulnerable. Specific models mentioned in the research include Llama 2, Mistral, and Qwen.

Mitigation Steps:

  • Implement real-time monitoring of constrained decoding to detect adversarial pattern manipulation.
  • Utilize dynamic refusal template diversification to make it harder for attackers to reliably suppress safety responses.
  • Integrate input-output consistency verification with adaptive logit masking during constrained decoding.
  • Enforce multi-stage safety alignment mechanisms, extending beyond just output constraints, especially for reasoning models. Consider abstracting reasoning processes or hiding intermediate steps from the user.

© 2025 Promptfoo. All rights reserved.