Multi-Objective LLM Jailbreak

Description: Large Language Models (LLMs) are vulnerable to a multi-objective black-box jailbreaking attack (BlackDAN) that optimizes prompts to maximize the likelihood of generating unsafe responses while maintaining contextual relevance and minimizing detectability. The attack leverages a multi-objective evolutionary algorithm (NSGA-II) to balance attack success rate, semantic consistency, and stealthiness, resulting in more effective and less easily detectable jailbreaks than single-objective approaches.

Examples: See https://github.com/MantaAI/BlackDAN for the code and detailed examples of the BlackDAN attack methodology and generated prompts used across multiple LLMs and Multimodal LLMs. Specific examples are documented in the paper's experimental section and supplementary materials.

Impact: Successful exploitation of this vulnerability allows adversaries to bypass safety mechanisms implemented in LLMs and elicit harmful, unsafe, or unintended outputs. This can lead to the generation of responses containing profanity, threats, misinformation, discriminatory language, or instructions for illegal activities. The multi-objective nature of the attack makes it more difficult to detect and mitigate as compared to single-objective attacks.

Affected Systems: A wide range of LLMs and Multimodal LLMs are affected, including but not limited to Llama-2-7b-hf, Llama-2-13b-hf, Internlm2-chat-7b, Vicuna-7b, AquilaChat-7B, Baichuan-7B, Baichuan2-13BChat, GPT-2-XL, Minitron-8B-Base, Yi-1.5-9B-Chat, llava-v1.6-mistral-7b-hf, and llava-v1.6-vicuna-7b-hf. The vulnerability is likely applicable to other LLMs using similar safety mechanisms.

Mitigation Steps:

Improved prompt filtering: Implement more robust prompt filtering mechanisms that are not easily bypassed by semantically similar, contextually relevant prompts. This filter should account for various paraphrases and rewordings of harmful input.
Enhanced safety model development: Develop more robust and advanced safety models that can effectively detect and filter harmful outputs, even those generated from seemingly benign prompts. Consider using multi-modal safety models if image inputs are also used.
Multi-objective defense mechanisms: Develop defense mechanisms that specifically target multi-objective attacks, considering multiple aspects of prompt evaluation, including semantic analysis beyond simple keyword matching.
Regular security assessments: Conduct periodic security audits and red teaming exercises to identify and mitigate vulnerabilities and continuously improve safety measures.

Multi-Objective LLM Jailbreak

Research Paper