LMVD-ID: 4e16c920
Published May 1, 2025

Expanded Strategy Jailbreak

Affected Models:claude-3.5-sonnet, llama3-8b, qwen-2.5-7b, gpt-4o, gpt-3.5

Research Paper

Breaking the Ceiling: Exploring the Potential of Jailbreak Attacks through Expanding Strategy Space

View Paper

Description: Large Language Models (LLMs) are vulnerable to jailbreak attacks that exploit the model's inherent persuasive nature. A novel attack framework, CL-GSO, decomposes jailbreak strategies into four components (Role, Content Support, Context, Communication Skills), creating a significantly expanded strategy space compared to prior methods. This expanded space allows for the generation of prompts that bypass safety protocols with a success rate exceeding 90% on models previously considered resistant, such as Claude-3.5. The vulnerability lies in the susceptibility of the LLM's reasoning and response generation mechanisms to strategically crafted prompts leveraging these four components.

Examples: See https://github.com/Aries-iai/CL-GSO for code and examples of successful jailbreak prompts generated using the CL-GSO framework. Specific examples against Claude-3.5, Llama 3, and Qwen-2.5 are provided in the Appendix of the referenced paper.

Impact: Successful jailbreak attacks can lead to LLMs generating harmful, unsafe, or biased outputs, circumventing their safety mechanisms. This can have significant consequences, depending on the LLM's application, including generating illegal instructions, spreading misinformation, or revealing sensitive information. The high success rate of CL-GSO against state-of-the-art safety-aligned LLMs highlights the severity of this vulnerability.

Affected Systems: The vulnerability affects various LLMs, including but not limited to Claude-3.5, Llama 3, and Qwen-2.5, and potentially other LLMs with similar safety mechanisms. The documented high cross-model transferability suggests a broad impact across different LLM architectures.

Mitigation Steps:

  • Improve Safety Mechanisms: Develop more robust safety mechanisms that are resistant to diverse attack strategies, considering not only content but also the underlying persuasive intent of the input.
  • Expand Training Data: Expand the training data used for safety alignment to include a wider variety of adversarial prompts and attack strategies, such as those generated by the CL-GSO framework.
  • Implement Advanced Filtering: Employ sophisticated filtering techniques that analyze prompts for underlying persuasive intent, rather than relying solely on keyword-based or content-based filtering.
  • Regular Security Audits: Perform regular security audits of LLMs using diverse and advanced adversarial testing methods to identify and address vulnerabilities.

© 2025 Promptfoo. All rights reserved.