LMVD-ID: 7a2aaaf6
Published January 1, 2025

Multi-Turn LLM Jailbreak

Affected Models:llama-3-8b, mistral-7b, gpt-4o, gemini-1.5-pro, claude-3.5, qwen2.5-7b

Research Paper

Siren: A Learning-Based Multi-Turn Attack Framework for Simulating Real-World Human Jailbreak Behaviors

View Paper

Description: Large Language Models (LLMs) are vulnerable to multi-turn adversarial attacks that skillfully decompose malicious requests into seemingly benign interactions, progressively guiding the dialogue towards harmful outputs. This vulnerability allows attackers to bypass LLM safety mechanisms through a series of strategically crafted prompts, exploiting the model's iterative response generation. The attack's success hinges on dynamically adapting each prompt based on the LLM's previous responses, making simple keyword-based detection ineffective.

Examples: See https://github.com/YiyiyiZhao/siren

Impact: Successful exploitation of this vulnerability leads to the generation of harmful, unethical, or illegal content by the LLM, potentially causing significant reputational damage and real-world harm. The dynamic nature of the attack makes it particularly difficult to defend against using existing single-turn mitigation techniques.

Affected Systems: Various LLMs, including but not limited to, LLaMA-3-8B, Mistral-7B, Qwen2.5-7B, GPT-4, Claude, and Gemini-1.5-Pro are shown to be vulnerable in the research paper. The vulnerability is likely to affect other LLMs as well.

Mitigation Steps:

  • Implement multi-turn dialogue safety scrutiny: Analyze the entire conversation context, not just individual prompts, to detect potentially harmful trajectories.
  • Develop robust context-aware safety mechanisms: LLM safety modules should actively track evolving conversation context and flag potentially harmful pathways before completion.
  • Utilize advanced detection methods: Move beyond simple keyword filtering to more sophisticated techniques that can identify subtle shifts in conversation direction indicative of adversarial influence. This might include techniques measuring semantic similarity changes from turn to turn for the LLM's output with respect to the prompted questions.
  • Adopt reinforcement learning for safety: Train LLMs to be more resilient to multi-turn manipulation through fine-tuning using data that includes adversarial prompts.

© 2025 Promptfoo. All rights reserved.