LLM Self-Jailbreaking Attack

Description: Large Language Models (LLMs) with refusal training are vulnerable to a "jailbreaking-to-jailbreak" (J2) attack. A J2 attack involves initially jailbreaking a powerful LLM to create a "J2 attacker." This attacker, instructed with general jailbreaking strategies, then autonomously attempts to jailbreak other LLMs, including potentially the same model it was derived from, by iteratively refining its attack based on previous attempts and in-context learning.

Examples: See https://scale.com/research/j2 (Specific prompting details are withheld to prevent misuse, but the methodology is publicly available). Examples of successful J2 attacks against GPT-4 and other LLMs are detailed in the paper, demonstrating the ability to elicit harmful behaviors through multi-turn interactions.

Impact: Successful J2 attacks can bypass safety mechanisms implemented in LLMs, leading to the generation of unsafe, harmful, or malicious content. This undermines the effectiveness of refusal training and poses a significant risk for applications relying on such models. The attack's success rate is high (91-93% reported against GPT-4 in the paper), and the J2 attacker can learn and improve its effectiveness over time.

Affected Systems: LLMs employing refusal training mechanisms, including (but not limited to) models from Google (Gemini), Anthropic (Sonnet), and OpenAI (GPT-4). The vulnerability is shown to affect various LLMs with differing sizes and architectures.

Mitigation Steps:

Enhance refusal training to be more robust against multi-turn attacks and iterative learning.
Implement more sophisticated detection mechanisms to identify and block J2 attack patterns.
Develop methods to limit the ability of LLMs to recursively learn and improve jailbreaking techniques.
Restrict or monitor LLM API access to prevent the creation and use of J2 attackers.
Regularly evaluate LLMs against various red-teaming techniques, including J2 attacks.

LLM Self-Jailbreaking Attack

Research Paper