Self-Instruct LLM Jailbreak

Description: Large Language Models (LLMs) are vulnerable to a self-instruct few-shot jailbreaking attack that leverages pattern and behavior learning to bypass safety mechanisms. The attack efficiently induces harmful outputs by injecting a strategically chosen response prefix into the model's prompt and exploiting the model's tendency to mimic co-occurrence patterns of special tokens preceding the prefix. This allows the attacker to elicit unsafe responses with a small number of carefully crafted examples, even with models enhanced with perplexity filters or perturbation defenses.

Examples: See https://github.com/iphosi/Self-Instruct-FSJ. The paper provides specific examples showcasing the attack's success rate on various open-source LLMs, including Llama 2, Llama 3, and others. These examples illustrate how the addition of a specific response prefix and the repetition of model-specific special tokens effectively reduce the perplexity of the target response, leading to a higher likelihood of unsafe outputs.

Impact: Successful exploitation of this vulnerability leads to the LLM generating unsafe, harmful, or malicious content, bypassing its internal safety mechanisms. This can potentially result in the dissemination of misinformation, hate speech, instructions for illegal activities, or other harmful outputs.

Affected Systems: Multiple Large Language Models (LLMs), specifically those based on the Llama architecture (Llama 2, Llama 3, etc.) and others tested in the linked repository. The vulnerability is not limited to specific models but rather represents a class of vulnerabilities applicable to various LLMs.

Mitigation Steps:

Improved Prompt Engineering: Enhance prompt engineering techniques to detect and mitigate the injection of malicious response prefixes and special token patterns.
Enhanced Safety Mechanisms: Develop more robust safety mechanisms that are resilient to pattern-based attacks. This might involve employing more sophisticated methods for detecting and preventing the generation of unsafe content.
Input Sanitization: Implementing more rigorous input sanitization to filter or block potentially harmful prefixes and patterns.
Adversarial Training: Include adversarial examples in training data to make the model more robust against this type of attack. This includes training on similar attack patterns demonstrated in the research.
Output Filtering: Implement more advanced output filters more effectively detecting maliciously crafted responses. These filters should go beyond simple keyword-based approaches.

Self-Instruct LLM Jailbreak

Research Paper