Multi-Round Jailbreak Agent

Description: Large Language Models (LLMs) are vulnerable to multi-round jailbreak attacks which leverage a heuristic search process to progressively elicit harmful content. The attack decomposes a harmful query into multiple, seemingly innocuous sub-queries, iteratively refining the prompts based on the LLM's responses and employing psychological strategies to bypass safety mechanisms. This allows for the circumvention of single-round detection methods and elicitation of responses containing prohibited content.

Examples: The attack utilizes a multi-round dialogue strategy. Starting with benign queries related to the target harmful query (e.g., "how to make a bomb" might start with "chemical reactions"), the attacker refines subsequent queries based on the LLM's response, gradually escalating towards the original harmful query. Psychological strategies, such as flattery or appeals to authority, are employed to make the LLM more likely to provide harmful content. See arXiv:2405.18540 for specific examples.

Impact: Successful exploitation of this vulnerability allows attackers to circumvent LLM safety measures and elicit harmful, unethical, or illegal responses, such as instructions for creating weapons, promoting violence, or spreading disinformation. This compromises the safety and reliability of LLM-based applications.

Affected Systems: All LLMs susceptible to multi-round dialogue are affected, including, but not limited to, GPT-3.5-Turbo, GPT-4, Vicuna-7B-1.5, LLAMA2-7B-CHAT, and MISTRAL-7B-INSTRUCT0.2. The vulnerability appears to be highly transferable across different model architectures.

Mitigation Steps:

Improved prompt engineering: Develop more robust safety prompts that are resistant to iterative refinement and psychological manipulation.
Multi-round dialogue detection: Implement mechanisms to detect and block conversation patterns consistent with the described attack strategy.
Contextual risk assessment: Enhance safety mechanisms to evaluate the cumulative risk of multiple interactions over a conversation.
Reinforcement learning from human feedback (RLHF) with data specifically addressing the vulnerabilities exposed by these multi-round attacks will further improve safety.

Multi-Round Jailbreak Agent

Research Paper