Multi-Round LLM Jailbreak
Research Paper
Multi-round jailbreak attack on large language models
View PaperDescription: A multi-round attack against Large Language Models (LLMs) allows bypassing safety mechanisms by iteratively refining prompts to elicit undesired behavior. The attack leverages the LLM's tendency to adjust its response based on preceding interactions, circumventing single-round prompt filtering defenses.
Examples: Unavailable due to paper withdrawal.
Impact: LLMs can be manipulated to generate harmful content, such as hate speech, misinformation, or instructions for illegal activities, despite safety protocols. This undermines trust and safety features implemented in LLM applications.
Affected Systems: All LLMs that employ iterative prompt-response mechanisms and rely solely on single-round prompt filtering for safety.
Mitigation Steps: Unavailable due to paper withdrawal.
© 2025 Promptfoo. All rights reserved.