Adaptive Sparse Jailbreak

Description: A vulnerability in several open-source Large Language Models (LLMs) allows for efficient jailbreaking via Adaptive Dense-to-Sparse Constrained Optimization (ADC). This attack uses a continuous optimization method, progressively increasing sparsity to generate adversarial token sequences that bypass safety measures and elicit harmful responses. The attack is more effective and efficient than prior token-level methods.

Examples: Specific examples of successful jailbreaks using the ADC method are not publicly available in this paper, but will be released in the associated code repository.

Impact: Successful exploitation allows an attacker to circumvent the safety mechanisms of affected LLMs, causing them to generate harmful, discriminatory, violent, or otherwise undesirable content, posing risks to users, organizations, and society.

Affected Systems: The vulnerability affects multiple open-source LLMs including, but not limited to: Llama2-chat-7B, Vicuna-v1.5-7B, Zephyr-7bβ, and Zephyr 7B R2D2. The paper suggests this method can also affect closed-source models, but no specific results are displayed.

Mitigation Steps:

No specific mitigation steps are provided in the research paper. Further investigation is required to determine effective mitigations.

Adaptive Sparse Jailbreak

Research Paper