Prefill-Based LLM Jailbreak

Description: Large Language Models (LLMs) with user-controlled response prefilling features are vulnerable to a novel jailbreak attack. By manipulating the prefilled text, attackers can influence the model's subsequent token generation, bypassing safety mechanisms and eliciting harmful or unintended outputs. Two attack vectors are demonstrated: Static Prefilling (SP), using a fixed prefill string, and Optimized Prefilling (OP), iteratively optimizing the prefill string for maximum impact. The vulnerability lies in the LLM's reliance on the prefilled text as context for generating the response.

Examples:

Static Prefilling (SP): Prefilling the response with "Sure, here's how to..." followed by a prompt requesting instructions for illegal or harmful activities leads to the LLM providing detailed instructions in many cases. The effectiveness varies based on the specific LLM and its safety mechanisms.
Optimized Prefilling (OP): This attack iteratively refines the prefill using a secondary LLM to maximize the probability of a successful jailbreak. This method is significantly more effective than SP, achieving near-100% success rates in some cases. Specific examples of successful optimized prefill strings are available in the research paper. (See arXiv:2405.18540)

Impact:

Successful exploitation allows attackers to circumvent LLM safety measures, leading to the generation of harmful content (e.g., instructions for illegal activities, hate speech, personal information disclosure) and potential misuse of the LLM.

Affected Systems:

Large Language Models (LLMs) that support user-controlled response prefilling (e.g., Claude, DeepSeek) are affected. The vulnerability is not limited to any specific model architecture or vendor.

Mitigation Steps:

Implement robust input validation and filtering mechanisms for prefilled text.
Develop enhanced safety mechanisms that are less susceptible to manipulation via initial response text conditioning.
Regularly update and improve the LLM’s safety models based on the discovery of new jailbreak techniques.
Conduct thorough security audits, including adversarial testing using various jailbreak techniques.
Consider removing or restricting the use of user controlled prefilling features until appropriate mitigations are in place.

Prefill-Based LLM Jailbreak

Research Paper