Schema-Guided LLM Jailbreak
Research Paper
Output Constraints as Attack Surface: Exploiting Structured Generation to Bypass LLM Safety Mechanisms
View PaperDescription: Large Language Models (LLMs) with structured output APIs (e.g., using JSON Schema) are vulnerable to Constrained Decoding Attacks (CDAs). CDAs exploit the control plane of the LLM's decoding process by embedding malicious intent within the schema-level grammar rules, bypassing safety mechanisms that primarily focus on input prompts. The attack manipulates the allowed output space, forcing the LLM to generate harmful content despite a benign input prompt. One instance of a CDA is the Chain Enum Attack, which leverages JSON Schema's enum
feature to inject malicious options into the allowed output, achieving high success rates.
Examples: See the paper "Output Constraints as Attack Surface: Exploiting Structured Generation to Bypass LLM Safety Mechanisms" for detailed examples of Chain Enum Attacks against various LLMs, including GPT-4o and Gemini-2.0-flash. The paper provides specific JSON schema examples used to elicit harmful responses.
Impact: Successful exploitation of this vulnerability allows attackers to bypass LLM safety mechanisms and generate harmful or malicious content, including misinformation, hate speech, and instructions for illegal activities. The impact is amplified by the ease of exploiting this vulnerability with a single query, and across different models with varying levels of safety mechanisms.
Affected Systems: LLMs that utilize structured output APIs and constrained decoding techniques, such as those supporting JSON Schema, regular expressions, or other grammar-based output constraints. This includes, but is not limited to, models from OpenAI, Google (Gemini), and various open-source LLMs utilizing frameworks that support constrained decoding.
Mitigation Steps:
- Implement "safety-preserving constraints" that prevent users from constraining specific safety-related tokens (e.g., apology phrases) within grammar rules.
- Develop context-aware token attribution mechanisms to track the origin of each generated token, distinguishing between user-specified constraints and model-generated content.
- Integrate safety signaling mechanisms within the LLM itself, allowing the model to detect and signal potential safety violations during generation, even within constrained outputs. This could involve generating specific tokens that trigger additional safety checks or halt the generation process.
© 2025 Promptfoo. All rights reserved.