Subconscious LLM Jailbreak

Description: Large Language Models (LLMs) are vulnerable to a novel attack leveraging subconscious exploitation and echopraxia. Attackers craft prompts that subtly guide the LLM to echo malicious content it has implicitly learned during pre-training but is programmed to suppress. This bypasses safety mechanisms designed to prevent the generation of harmful content. The technique involves extracting malicious knowledge from the LLM's conditional probability distribution (representing its "subconscious") and then using an optimization process to construct a prompt that triggers the LLM to involuntarily repeat the harmful information.

Examples: See https://github.com/SolidShen/RIPPLE_official/tree/official

Impact: LLMs can be coerced into generating harmful and illegal content, bypassing built-in safety measures. This poses a significant risk in applications where LLMs handle sensitive information or perform safety-critical tasks. The attack demonstrates high success rates across multiple open-source and commercially available LLMs.

Affected Systems: A wide range of LLMs, including both open-source and commercially available models, are vulnerable. Specific models affected include but are not limited to LLaMA2-7B, LLaMA2-13B, Falcon-7B-instruct, Vicuna-7B, Baichuan2-7B-chat, Alpaca-7B, GPT-3.5-turbo, GPT-4, Bard, and Claude2.

Mitigation Steps:

Enhance LLM training data filtering to remove or mitigate harmful content.
Develop more robust safety mechanisms that are resilient to subconscious exploitation and echopraxia.
Implement detection mechanisms that identify prompts designed to elicit harmful responses through subtle cues and indirect manipulation.
Regularly update and improve existing detection and mitigation techniques to counter evolving attack methods.
Consider augmenting LLMs with additional layers to detect and actively counter the technique (e.g., by comparing input and output similarity and limiting highly similar responses).

Subconscious LLM Jailbreak

Research Paper