Cognitive Consistency Jailbreak

Description: A vulnerability in several large language models (LLMs) allows attackers to bypass safety restrictions ("jailbreaking") by employing a Foot-in-the-Door (FITD) technique. This involves progressively escalating prompts, starting with innocuous requests and gradually leading to the elicitation of harmful or restricted information. The LLM's tendency towards cognitive consistency makes it more likely to respond to subsequent, increasingly sensitive prompts after initially agreeing to less harmful ones.

Examples: See the paper for examples of how innocuous prompts are gradually escalated into prompts extracting harmful information. The paper provides examples categorized by malicious intent (hate speech, harassment, etc.) and illustrates the multi-step FITD process in each.

Impact: Successful exploitation allows attackers to obtain restricted information from the LLM, potentially including instructions for illegal activities, harmful content generation, or sensitive data. This undermines the security and safety measures implemented in the targeted LLMs.

Affected Systems: The vulnerability impacts multiple LLMs including, but not limited to, GPT-3.5, GPT-4, Claude-i, Claude-2, Gemini, Llama-2, ChatGLM-2, and ChatGLM-3. The specific versions tested are detailed in the paper. The research suggests that the vulnerability is likely prevalent in other LLMs employing similar safety mechanisms.

Mitigation Steps:

Improved prompt filtering: Implement more robust prompt filtering mechanisms to detect and prevent FITD attacks. This may involve analyzing prompt sequences for escalating requests rather than focusing solely on individual prompts.
Enhanced safety training: Refine safety training data to improve the model's ability to recognize and reject harmful requests, even when presented in a stepwise manner.
Contextual awareness: Develop models with improved contextual awareness to better understand the implications of a series of prompts and identify potentially harmful patterns.
Response monitoring and limitations: Implement stricter monitoring of generated responses and introduce limitations on the number of consecutive prompts that the LLM will answer, especially if those responses deviate from initial safety boundaries.

Cognitive Consistency Jailbreak

Research Paper