LMVD-ID: 35e502e3
Published September 1, 2024

Concealed Multi-Turn Jailbreak

Affected Models:gpt-4o, llama3, llama3.1, qwen2, mixtral

Research Paper

RED QUEEN: Safeguarding Large Language Models against Concealed Multi-Turn Jailbreaking

View Paper

Description: Large Language Models (LLMs) are vulnerable to a novel multi-turn jailbreaking attack, termed "RED QUEEN ATTACK." This attack uses multi-turn conversations to conceal malicious intent by framing the user as a protector seeking to prevent harmful actions by others. The LLM, instead of detecting the concealed malicious intent, provides information that facilitates the harmful action under the guise of assisting in prevention efforts.

Examples: See https://github.com/kriti-hippo/red_queen. Examples include scenarios where a user, pretending to be a concerned friend/police officer/etc., elicits detailed instructions for bomb-making or other illegal activities by claiming to be thwarting someone else's plans.

Impact: Successful exploitation of this vulnerability allows attackers to bypass LLM safety mechanisms and obtain detailed instructions for harmful activities, including bomb-making, illegal substance production, and other malicious plans. The vulnerability is more pronounced in larger LLMs, suggesting a scale-dependent vulnerability in current safety mechanisms.

Affected Systems: Multiple LLMs, including but not limited to GPT-4, Llama3, Llama3.1, Qwen2, and Mixtral, across various sizes (7B parameters to 405B parameters) are susceptible.

Mitigation Steps:

  • Implement a Direct Preference Optimization (DPO) based mitigation strategy, as described in the RED QUEEN GUARD method, by training models on data incorporating multi-turn attacks and corresponding safe responses.
  • Improve LLM safety mechanisms to better detect concealed malicious intent within multi-turn conversations. This may require advancing contextual understanding and intent recognition capabilities.
  • Employ robust multi-turn safety evaluation frameworks to thoroughly assess the robustness of LLM safety measures against sophisticated attacks.

© 2025 Promptfoo. All rights reserved.