Multi-Round LLM Jailbreak

Description: A multi-round attack against Large Language Models (LLMs) allows bypassing safety mechanisms by iteratively refining prompts to elicit undesired behavior. The attack leverages the LLM's tendency to adjust its response based on preceding interactions, circumventing single-round prompt filtering defenses.

Examples: Unavailable due to paper withdrawal.

Impact: LLMs can be manipulated to generate harmful content, such as hate speech, misinformation, or instructions for illegal activities, despite safety protocols. This undermines trust and safety features implemented in LLM applications.

Affected Systems: All LLMs that employ iterative prompt-response mechanisms and rely solely on single-round prompt filtering for safety.

Mitigation Steps: Unavailable due to paper withdrawal.

Multi-Round LLM Jailbreak

Research Paper