Multi-Turn Foot-In-The-Door Jailbreak

Description: A multi-turn prompt injection attack, termed "Foot-In-The-Door" (FITD), exploits the psychological principle of incremental commitment to progressively escalate malicious requests, bypassing LLM safety mechanisms. The attack leverages intermediate "bridge" prompts and self-alignment techniques to coax the model into generating increasingly harmful outputs, even when initially refusing similar direct requests.

Examples: See https://github.com/Jinxiaolong1129/Foot-inthe-door-Jailbreak and https://fitd-foot-in-the-door.github.io/. Examples include escalating a request from a harmless inquiry about email security features to detailed instructions for hacking an email account, or from a benign request to a hate-filled letter.

Impact: Successful exploitation of this vulnerability allows attackers to bypass LLM safety filters and elicit harmful, unethical, or illegal content. This can lead to the generation of malicious code, instructions for illegal activities, hate speech, personally identifiable information leaks, and more. The attack's high success rate (94% average across seven models) highlights a significant risk to LLM safety and security.

Affected Systems: The vulnerability affects a wide range of LLMs, including both open-source (LLaMA, Qwen, Mistral) and closed-source (GPT-4) models. The attack demonstrates cross-model transferability, meaning attacks developed on one model can often be effective against others.

Mitigation Steps:

Implement more robust multi-turn conversation safety mechanisms that detect and prevent gradual escalation of malicious intent.
Develop improved alignment techniques that are resilient to iterative manipulation and “self-corruption” of model outputs.
Incorporate more sophisticated detection mechanisms that account for paraphrasing and contextual shifting of requests.
Regularly audit LLMs for vulnerabilities using advanced automated red-teaming techniques that simulate multi-turn interactions.
Enhance user input validation and filtering to detect and mitigate malicious intent in multi-turn dialogue, including techniques that specifically address the "foot-in-the-door" pattern.

Multi-Turn Foot-In-The-Door Jailbreak

Research Paper