Autonomous Multi-Turn LLM Jailbreak

Description: Large Language Models (LLMs) are vulnerable to multi-turn adversarial attacks that exploit incremental policy erosion. The attacker uses a breadth-first search strategy to generate multiple prompts at each turn, leveraging partial compliance from previous responses to gradually escalate the conversation towards eliciting disallowed outputs. Minor concessions accumulate, ultimately leading to complete circumvention of safety measures.

Examples: See the "Siege" paper (details omitted due to length of provided text, refer to the original publication).

Impact: Successful exploitation allows attackers to circumvent LLM safety restrictions and obtain disallowed information or instructions, potentially leading to the generation of harmful content, such as instructions for illegal activities, malicious code, or personally identifiable information. The attack achieves a 100% success rate on GPT-3.5-turbo and 97% on GPT-4 in a single multi-turn run.

Affected Systems: Large Language Models (LLMs) susceptible to multi-turn adversarial prompting, including (but not limited to) GPT-3.5-turbo, GPT-4, and Llama 3.1-70B.

Mitigation Steps:

Implement more robust multi-turn safety mechanisms that account for cumulative policy violations and adapt dynamically to adversarial interactions.
Develop sophisticated detection mechanisms that identify and mitigate partial compliance leading to gradual safety erosion.
Utilize advanced techniques to prevent the exploitation of incremental concessions in the model's responses.
Strengthen prompt filtering and response validation to prevent the re-injection of leaked information into subsequent queries. Consider using techniques to assess risk not just on individual turns, but cumulatively throughout a conversation.

Autonomous Multi-Turn LLM Jailbreak

Research Paper