LMVD-ID: 54e57dd9
Published August 1, 2023

LLM Self-Deception Jailbreak

Research Paper

Self-deception: Reverse penetrating the semantic firewall of large language models

View Paper

Description: Large language models (LLMs) are vulnerable to a "self-deception" attack, where carefully crafted prompts induce the model to bypass its internal safety mechanisms and generate outputs that would normally be blocked (e.g., harmful, biased, or illegal content). This occurs by exploiting inconsistencies in the model's internal reasoning processes, making it generate outputs that contradict its own safety policies. The attack does not involve direct code injection or data poisoning but rather manipulates the model's semantic understanding.

Examples: See arXiv:2308.11521. (Note: Since the paper has been withdrawn, examples cannot be provided at this time.)

Impact: Successful exploitation allows adversaries to elicit undesirable outputs from the LLM, potentially leading to the generation and dissemination of harmful or inappropriate content. This may damage the reputation of the LLM provider, cause legal liabilities, and facilitate malicious activities such as disinformation campaigns or the generation of offensive material.

Affected Systems: All LLMs susceptible to prompt manipulation techniques that exploit inconsistencies in their internal reasoning and safety protocols.

Mitigation Steps: (Cannot be provided due to the paper's withdrawal).

© 2025 Promptfoo. All rights reserved.