Multi-Step Moralized Jailbreak

Description: Large Language Models (LLMs) are vulnerable to multi-step "moralized" jailbreak prompts that bypass their safety guardrails. These prompts, while appearing ethical individually, cumulatively create a context that elicits verbally aggressive and harmful content generation. The attack leverages the LLMs' inability to fully understand the cumulative context and intent across multiple prompts.

Examples: See https://github.com/brucewang123456789/GeniusTrail.git for the complete dataset and code. The attack involves a seven-step prompt sequence simulating a corporate scenario where a manager uses morally-justified criticisms to attack competitors, culminating in the LLM generating verbally abusive content.

Impact: Successful exploitation leads to the generation of harmful, verbally offensive content by the LLM, potentially causing reputational damage, inciting hatred, and enabling harassment. The vulnerability undermines the effectiveness of existing safety mechanisms and trust in LLMs.

Affected Systems: The vulnerability impacts GPT-4o, Grok-2 Beta, Llama 3.1 (405B), Gemini 1.5, and Claude 3.5 Sonnet, showcasing a potential weakness across different LLM architectures and vendors.

Mitigation Steps:

Enhance LLM safety mechanisms to consider cumulative context and intent across multiple prompts, rather than relying solely on single-prompt analysis.
Implement more robust methods for detecting and mitigating attempts to manipulate LLM responses through staged prompts, potentially utilizing techniques to identify subtle shifts in conversation intent.
Develop improved methods for identifying and classifying verbally abusive language, even when embedded within a seemingly benign context.

Multi-Step Moralized Jailbreak

Research Paper