Multi-turn Actor Jailbreak
Research Paper
Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered Clues
View PaperDescription: Large Language Models (LLMs) are vulnerable to multi-turn adversarial attacks where malicious users obscure harmful intents across multiple queries. The ActorAttack method leverages the LLM's own knowledge base to discover semantically linked "actors" related to a harmful target. By posing seemingly innocuous questions about these actors, the attacker guides the LLM towards revealing harmful information step-by-step, accumulating knowledge until the desired malicious output is obtained, even bypassing safety mechanisms. The attack dynamically adapts to the LLM's responses, enhancing its effectiveness.
Examples: See https://github.com/renqibing/ActorAttack and the associated paper for examples demonstrating the attack against various LLMs, including GPT-3.5, GPT-4, Claude-3.5, Llama-3-8B, and Llama-3-70B. Specific examples are also given in Figures 10, 11, 12, 13, and 14 of the paper.
Impact: Successful exploitation of this vulnerability could lead to the LLM revealing sensitive or harmful information, including instructions for creating harmful devices, illegal activities, or promoting harmful ideologies, even if the LLM's safety mechanisms are in place. The attack's multi-turn nature makes detection more challenging.
Affected Systems: Various Large Language Models (LLMs) are susceptible, including but not limited to those listed above. The vulnerability is not tied to a specific model architecture but rather the inherent knowledge base and reasoning capabilities of LLMs.
Mitigation Steps:
- Augment safety training data with multi-turn adversarial examples generated using the ActorAttack technique or similar methods.
- Develop more robust detection mechanisms for identifying the subtle shifts in conversational topics indicative of this type of attack.
- Implement more sophisticated response filtering and control mechanisms to prevent the accumulation of harmful information across multiple turns.
- Improve the LLM's ability to explicitly reason about and reject potentially dangerous lines of inquiry, even if those lines cleverly avoid explicit keywords.
© 2025 Promptfoo. All rights reserved.