Segmented Prompt Jailbreak

Description: Large Language Models (LLMs) incorporating safety filters are vulnerable to a "Prompt, Divide, and Conquer" attack. This attack segments a malicious prompt into smaller, seemingly benign parts, processes these segments in parallel across multiple LLMs, and then reassembles the results to generate malicious code, bypassing the safety filters. The attack's success relies on the iterative refinement of initially abstract function descriptions into concrete implementations. Individual LLM safety filters are bypassed because no single segment triggers the filter.

Examples: See arXiv:2405.18540 for detailed examples and the CySecBench dataset used for testing. The paper provides numerous examples demonstrating the generation of malicious code across various cybersecurity attack categories (e.g., ransomware, DoS attacks, phishing emails) using this technique.

Impact: Successful exploitation allows adversaries to generate malicious code using LLMs, even if those LLMs have safety measures in place. This significantly lowers the barrier to entry for creating harmful software, leading to increased risk of cyberattacks. The use of parallel processing across multiple LLMs could make detection more difficult. The iterative refinement process makes it difficult for simple keyword-based filtering to prevent malicious output.

Affected Systems: Large Language Models from Anthropic, Google, and OpenAI, and potentially others, that employ safety filters and support API access for prompt processing. The vulnerability seems to be inherent to the architecture.

Mitigation Steps:

Improve Segmentation Detection: Develop enhanced prompt analysis techniques to identify attempts at prompt segmentation and avoid generating partial responses to suspicious segments.
Contextual Awareness: Enhance safety filters to be more context-aware, considering the overall goal implied in a series of prompts rather than just assessing individual prompts in isolation.
Output Validation: Implement more robust output validation mechanisms that go beyond keyword-based filtering and analyze the generated code for malicious functionality. This could include dynamic analysis and runtime checks.
Rate Limiting: Introduce rate limits on API access to hinder parallel processing attacks.
Multi-Stage Filtering: Employ multiple stages of filtering, each operating on different aspects of the prompt and the generated response.

Segmented Prompt Jailbreak

Research Paper