LMVD-ID: 1f0fcf78
Published February 1, 2025

CRI Jailbreak Initialization

Affected Models:llama-2-7b-chat-hf, vicuna-7b-1.3, meta-llama-3-8b-instruct, tiiuae/falcon-7b-instruct, mistralai/mistral-7b-instruct-v0.2, mistralai/mistral-7b-instruct-v0.3, microsoft/phi-4, qwen/qwen2.5-coder-7b-instruct

Research Paper

Enhancing Jailbreak Attacks via Compliance-Refusal-Based Initialization

View Paper

Description: CRI (Compliance Refusal Initialization) initializes jailbreak attacks by leveraging pre-trained jailbreak prompts, effectively guiding the optimization process towards the compliance subspace of harmful prompts. This significantly enhances the success rate and reduces the computational overhead of attacks, often requiring only a single optimization step to bypass safety mechanisms. Attacks utilizing CRI demonstrate significantly improved ASR (Adversarial Success Rate) and reduced median steps to success.

Examples: See arXiv:2409.11718. Specifically, observe the results in Figures 3, 4, and 5, which demonstrate the increased ASR of attacks (GCG, AutoDAN-GA, and AutoDAN-HGA respectively) across various LLMs when using CRI initialization compared to standard or random initialization. Real-world examples of jailbreak prompts and model responses are provided in Appendix D and E of the linked paper. For instance comparing standard initialization with CRI initialization in Figure 2, with the latter achieving a jailbreak in two optimization steps.

Impact: LLMs that are vulnerable to CRI-based jailbreak attacks can be coerced into generating harmful, unethical, or otherwise undesirable content, potentially leading to the spread of misinformation, malicious code generation, or the facilitation of illegal activities. Existing safety mechanisms are bypassed or rendered ineffective.

Affected Systems: Large Language Models (LLMs) susceptible to gradient-based jailbreak attacks, including but not limited to Llama-2, Vicuna, and Llama-3.

Mitigation Steps:

  • Implement robust refusal mechanisms: Develop and enhance the LLM's ability to identify and refuse harmful prompts.
  • Adversarial Training: Integrate pre-trained jailbreak transformations into adversarial training to improve model resilience.
  • Monitor and Detect Jailbreak Attempts: Implement monitoring systems to detect and flag unusual or malicious prompts that exploit vulnerabilities like those described in the paper cited.

© 2025 Promptfoo. All rights reserved.