LMVD-ID: 2cec4f9d
Published April 1, 2024

Logic-Chain Jailbreak

Affected Models:bert, gpt, gpt-4, palm 2

Research Paper

Hidden You Malicious Goal Into Benigh Narratives: Jailbreak Large Language Models through Logic Chain Injection

View Paper

Description: This vulnerability allows attackers to bypass LLM safety mechanisms and elicit malicious content by injecting a chain of benign, semantically equivalent narrations into a seemingly innocuous article. The LLM connects these scattered narrations, effectively executing the malicious intent hidden within the seemingly benign context. This differs from previous attacks which directly embed malicious prompts, making detection by both LLMs and human reviewers more difficult.

Examples: See paper for detailed examples of both "Paragraphed Logic Chain" and "Acrostic" style attacks. These examples demonstrate how seemingly innocuous text, when strategically structured, can be used to elicit responses that violate the LLM's safety guidelines.

Impact: Successful exploitation allows attackers to generate responses containing harmful content, bypassing LLM safety features. This could lead to the generation of instructions for illegal activities, dissemination of misinformation, or other malicious actions. The attack's subtlety makes human detection difficult, increasing its effectiveness.

Affected Systems: Large Language Models (LLMs) vulnerable to prompt injection attacks. Specifically, LLMs that rely on attention mechanisms to process text and lack sufficient defenses against cleverly crafted, distributed prompts. The specific LLMs affected may change over time due to model updates and security patches.

Mitigation Steps:

  • Implement more robust prompt filtering mechanisms that go beyond simple keyword detection and pattern matching.
  • Develop techniques to detect and disrupt logically connected sequences of seemingly benign prompts within larger text bodies.
  • Analyze the attention weights within the LLM during processing to identify unusually strong connections between seemingly unrelated parts of the input.
  • Conduct adversarial training specifically targeting this type of attack.
  • Implement human-in-the-loop verification for responses deemed high-risk by internal model scoring systems.

© 2025 Promptfoo. All rights reserved.