Nested-Scene LLM Jailbreak

Description: Large Language Models (LLMs) are vulnerable to a novel "DeepInception" attack that leverages the models' personification capabilities to bypass safety guardrails. The attack uses nested prompts to create a multi-layered fictional scenario, effectively hypnotizing the LLM into generating harmful content by exploiting its tendency towards obedience within the constructed narrative. This allows for continuous jailbreaks in subsequent interactions.

Examples: See https://github.com/tmlr-group/DeepInception. Examples include prompting LLMs to generate instructions for bomb-making, hacking, and creating phishing emails, all within a nested fictional framework. The attack is shown to be effective across various open-source (Llama-2, Falcon, Vicuna) and closed-source (GPT-3.5, GPT-4, GPT-4o) LLMs.

Impact: Successful exploitation of this vulnerability allows attackers to circumvent safety mechanisms implemented in LLMs, leading to the generation of harmful content capable of causing significant physical, psychological, or societal harm (e.g., bomb-making instructions, malware generation, disinformation campaigns). The continuous jailbreak capability further amplifies this risk.

Affected Systems: All Large Language Models (LLMs) tested in the DeepInception research, including both open-source and closed-source models, show susceptibility to this attack. This suggests a widespread vulnerability affecting a broad class of LLMs.

Mitigation Steps:

Prompt Engineering: Implement robust prompt filtering and validation mechanisms to detect and reject nested prompts indicative of DeepInception attacks.
Enhanced Safety Mechanisms: Develop more sophisticated safety mechanisms that can reliably identify and mitigate harmful content generation, even within complex contextual narratives.
Model Training: Involve enhanced training datasets and techniques that better equip LLMs to resist manipulation through nested prompt structures and fictional narratives.
Output Monitoring: Implement advanced output monitoring and detection systems capable of flagging even subtle manifestations of harmful intent originating from LLM responses.
Multi-modal Defense: Extend current defense mechanisms to encompass multi-modal inputs (image, audio) to prevent similar attacks exploiting other modalities.

Nested-Scene LLM Jailbreak

Research Paper