LLM Contextual Divergence Jailbreak

Description: Large Language Models (LLMs) are vulnerable to a jailbreak attack that leverages the model's ability to generate diverse and obfuscated prompts to bypass safety constraints. The attack exploits the model's capacity to deviate from prior context, rendering existing safety training ineffective. The attacker uses a multi-stage process involving diversification (generating prompts significantly different from previous attempts) and obfuscation (obscuring sensitive words/phrases) to elicit harmful outputs.

Examples: See the paper's figures and tables (e.g., Figure 1, Table 1, Table 2) for examples of successful jailbreak attacks on various LLMs, including GPT-4, Gemini, and Llama. The paper's appendix also includes specific system prompts used in the attack.

Impact: Successful exploitation allows adversaries to circumvent LLM safety measures and generate harmful outputs, such as instructions for creating illegal goods, spreading misinformation, or planning harmful activities. Attack success rates significantly exceed those of previous methods.

Affected Systems: A wide range of LLMs, including but not limited to OpenAI's GPT-3.5, GPT-4, GPT-4o, GPT-4o-mini, Google's Gemini, Meta's Llama 2, and other open-source models like Vicuna and Mistral. The vulnerability is likely present in other LLMs with similar safety mechanisms.

Mitigation Steps:

Enhanced safety training: Develop more robust safety training methods that are less susceptible to diverse and obfuscated prompts. The training should focus on preventing the model from generating harmful content even when presented with creatively worded or disguised inputs.
Improved prompt filtering and analysis: Implement more advanced techniques to detect and block potentially malicious prompts, even if they utilize obfuscation methods to mask their intent.
Adversarial training: Train LLMs with a diverse set of adversarial prompts to increase their resilience to jailbreak attempts. This includes incorporating prompts designed to exploit the vulnerabilities discovered in this research.
Runtime monitoring and response analysis: Implement monitoring systems to detect anomalous behavior during LLM interactions and block responses containing harmful content, regardless of how it is presented.
Regular security audits: Conduct repeated independent security audits of LLMs to discover and mitigate new vulnerabilities as they arise.

LLM Contextual Divergence Jailbreak

Research Paper