Sequential Prompt Jailbreak

Description: Large Language Models (LLMs) are vulnerable to "SequentialBreak," a jailbreak attack where embedding a harmful prompt within a chain of benign prompts in a single query can bypass LLM safety features. The LLM's attention mechanism prioritizes the benign prompts, allowing the harmful prompt to be processed without triggering safety mitigations.

Examples: See the GitHub repository https://anonymous.4open.science/r/JailBreakAttack-4F3B/ for detailed examples of "Question Bank," "Dialog Completion," and "Game Environment" attack scenarios, including specific prompt chains and resulting harmful outputs.

Impact: Successful exploitation allows attackers to bypass LLM safety mechanisms and elicit harmful or malicious responses, including but not limited to instructions for creating harmful devices, generating hate speech, or providing guidance on illegal activities. This undermines the intended safety features of LLMs and poses significant risks to users and society.

Affected Systems: All LLMs that utilize an attention mechanism and rely on current safety features are potentially vulnerable. This includes both open-source (e.g., Llama 2, Llama 3, Gemma 2, Vicuna) and closed-source (e.g., GPT-3.5, GPT-4) models.

Mitigation Steps:

Implement more robust prompt filtering mechanisms capable of detecting harmful content within complex sequential prompts.
Develop improved attention mechanisms that can better identify and prioritize potentially harmful prompts regardless of surrounding context.
Enhance LLM safety training data to include examples of sequential jailbreak attempts, improving the model's ability to recognize and reject such attacks.
Explore the use of multiple independent safety checks on the same prompt, each with different methods to detect harmful content.

Sequential Prompt Jailbreak

Research Paper