Dialogue History Jailbreak

Description: Large Language Models (LLMs) are vulnerable to Dialogue Injection Attacks (DIA), where malicious actors manipulate the chat history to bypass safety mechanisms and elicit harmful or unethical responses. DIA exploits the LLM's chat template structure to inject crafted dialogue into the input, even in black-box scenarios where the model's internals are unknown. Two attack methods are presented: one adapts gray-box prefilling attacks, the other leverages deferred responses to increase the likelihood of successful jailbreaks.

Examples:

Prefilling Attack (DIA-I): An attacker crafts a chat history that includes an affirmative beginning to a malicious response, followed by a "continue" command prompting the LLM to continue from that point. This bypasses safety measures that only check the initial tokens of an output. See paper for specific examples.
Deferred Response Attack (DIA-II): The attacker prompts the LLM to perform a word substitution task using benign examples before presenting a malicious prompt. The delay in responding to the malicious prompt increases the success rate. See paper for specific examples.

Impact: Successful DIA attacks can lead to LLMs generating harmful content, including but not limited to: biased outputs, generation of malicious code, disclosure of sensitive information, and promotion of illegal activities. This undermines the trustworthiness and reliability of LLM applications.

Affected Systems: LLMs that utilize a chat template to concatenate historical dialogues with the current prompt before processing, including but not limited to Llama-3.1, GPT-4, and other open-source models using similar chat architectures.

Mitigation Steps:

Improved Chat Template Design: Develop chat templates that are less susceptible to manipulation or that incorporate checks to detect inconsistencies or anomalies in the historical dialogue.
Enhanced Safety Mechanisms: Strengthen safety mechanisms to account for multi-turn interactions and to detect and mitigate instances of dialogue manipulation.
Robust Input Sanitization: Implement rigorous input sanitization techniques that can identify and remove or neutralize potentially malicious content from the chat history.
Advanced Detection Systems: Develop advanced detection systems capable of identifying patterns indicative of DIA attacks within chat logs.
Regular Security Audits: Conduct routine security audits to systematically assess LLMs for vulnerabilities related to dialogue injection and other adversarial techniques.

Dialogue History Jailbreak

Research Paper