Automated LLM Fuzz Jailbreak

Description: A novel black-box attack framework leverages fuzz testing to automatically generate concise and semantically coherent prompts that bypass safety mechanisms in large language models (LLMs), eliciting harmful or offensive responses. The attack starts with an empty seed pool, utilizes LLM-assisted mutation strategies (Role-play, Contextualization, Expand), and employs a two-level judge module for efficient identification of successful jailbreaks. The attack's effectiveness is demonstrated across several open-source and proprietary LLMs, exceeding existing baselines by over 60% in some cases.

Examples: See https://github.com/aaFrostnova/Effective-llm-jailbreak (Note: Specific examples of successful jailbreak prompts are likely to be withheld from public release due to the potential misuse of such information. However, the repository contains the code to reproduce the attack).

Impact: Successful exploitation allows attackers to circumvent LLM safety measures, generating harmful content (e.g., illegal activities, hate speech, discriminatory statements) bypassing intended safety restrictions. This compromises the integrity and reliability of the LLM, impacting user trust and potentially leading to real-world harm. The attack's high success rate and efficiency further exacerbate the potential impact.

Affected Systems: Multiple Large Language Models (LLMs), including but not limited to: LLaMA-2-7b-chat, Vicuna-7bv1.3, Baichuan2-7b-chat, Guanaco-7B, GPT-3.5 Turbo, GPT-4, and Gemini-Pro. The vulnerability is likely applicable to other LLMs using similar safety mechanisms.

Mitigation Steps:

Implement robust and multi-layered safety mechanisms beyond input filtering and output restrictions.
Develop more sophisticated detection methods resistant to semantically coherent adversarial prompts, such as those leveraging contextual understanding and advanced anomaly detection.
Regularly update and refine LLM safety models and datasets to adapt to evolving jailbreaking techniques.
Investigate techniques to detect and mitigate attacks based on prompt length and perplexity.
Employ adversarial training to improve LLM robustness against various attack vectors.

Automated LLM Fuzz Jailbreak

Research Paper