Logic-Based LLM Jailbreak
Research Paper
Logic Jailbreak: Efficiently Unlocking LLM Safety Restrictions Through Formal Logical Expression
View PaperDescription: Large Language Models (LLMs) employing safety mechanisms based on token-level distribution analysis are vulnerable to a jailbreak attack exploiting distributional discrepancies between alignment data and formally expressed logical statements. The vulnerability allows malicious actors to bypass safety restrictions by translating harmful natural language prompts into equivalent first-order logic expressions. The LLM, trained primarily on natural language, fails to recognize the harmful intent encoded in the logically expressed input which falls outside its expected token distribution.
Examples: See repository at https://anonymous.4open.science/r/Logibreak-DEBF. The repository contains specific examples of harmful prompts and their logical translations that successfully bypassed safety mechanisms in various LLMs.
Impact: Successful exploitation allows malicious actors to elicit harmful responses from LLMs, including but not limited to: generation of hate speech, misinformation, instructions for illegal activities, and personally identifiable information disclosure. This undermines the intended safety and responsible use of LLMs.
Affected Systems: LLMs implementing safety mechanisms that primarily rely on token-level pattern matching during prompt processing are vulnerable. This includes various closed-source and open-source models. Specific affected models are detailed in the referenced research paper.
Mitigation Steps:
- Develop and implement safety mechanisms that leverage semantic analysis of prompts, in addition to or instead of solely relying on token-level detection.
- Augment alignment datasets with logically expressed harmful prompts to improve model robustness against this type of attack.
- Employ multi-stage prompt verification that includes logical parsing and semantic similarity checks before response generation.
- Investigate and incorporate methods for detecting and mitigating attempts to bypass safety constraints using formal logical representations.
© 2025 Promptfoo. All rights reserved.