Auto-Generated LLM Jailbreaks

Description: Large Language Models (LLMs) are susceptible to automated jailbreak attacks using a fuzzing framework that generates variations of existing jailbreak prompts. This vulnerability allows bypassing built-in safety mechanisms, leading to the generation of harmful or unintended outputs. The vulnerability stems from the LLMs' inability to consistently recognize and reject semantically similar, but subtly different prompt variations generated through automated mutation techniques.

Examples: See the GPTFuzzer repository. Specific examples of successful jailbreaks against ChatGPT, Llama-2, and other LLMs, using automatically generated prompts, are extensively documented in the paper.

Impact: Successful exploitation of this vulnerability can lead to the generation of harmful content, including but not limited to instructions on illegal activities, hate speech, and misinformation. This compromises the safety and reliability of LLM-integrated applications and services.

Affected Systems: Various commercial and open-source LLMs, including but not limited to ChatGPT, Llama-2, Vicuna, Bard, Claude-2, and PaLM2. The impact potentially extends to any application incorporating these models.

Mitigation Steps:

Enhance LLM safety mechanisms to be more robust against prompt variations and mutations.
Implement more advanced detection methods for identifying and rejecting jailbreak attempts, considering semantic similarity and contextual understanding.
Regularly update and retrain models to incorporate new adversarial prompt patterns.
Employ robust input sanitization and validation procedures before feeding data into the LLM.
Consider integrating techniques to detect and mitigate adversarial prompt injections.

Auto-Generated LLM Jailbreaks

Research Paper