Prompt-Driven LLM Jailbreak

Description: The LLaMA-2-7b-chat large language model (LLM) is vulnerable to a prompt-driven attack, termed DROJ (Directed Representation Optimization Jailbreak), that optimizes prompts at the embedding level to circumvent safety mechanisms and elicit harmful responses. The attack shifts the hidden representations of harmful queries away from the model's refusal direction, leading to a high attack success rate even with safety prompts in place. While the model may not refuse, responses may be repetitive and uninformative.

Examples: See https://github.com/Leon-Leyang/LLM-Safeguard. The paper demonstrates the attack using the AdvBench and MaliciousInstruct datasets. Specific examples of crafted prompts and resulting model outputs are provided in the paper's figures and supplementary materials.

Impact: Successful exploitation of this vulnerability allows attackers to bypass LLM safety mechanisms designed to prevent the generation of harmful content, such as hate speech, misinformation, and instructions for illegal activities. While responses might be nonsensical, the bypass itself represents a failure of the safety mechanism.

Affected Systems: LLaMA-2-7b-chat and other LLMs potentially susceptible to similar embedding-level attacks. The paper indicates vulnerability in open-source LLMs fine-tuned from unaligned models.

Mitigation Steps:

Implement robust adversarial defense mechanisms capable of detecting and mitigating embedding-level attacks.
Develop more sophisticated safety prompts that are less susceptible to manipulation through embedding-level alterations.
Regularly update and retrain LLMs with a broader range of adversarial examples.
Consider techniques to enhance the LLM's ability to discern the intent behind prompts, moving beyond simple keyword analysis.

Prompt-Driven LLM Jailbreak

Research Paper