Implicit Prompt Code Jailbreak

Description: Large Language Models (LLMs) used for code generation are vulnerable to a jailbreaking attack that leverages implicit malicious prompts. The attack exploits the fact that existing safety mechanisms primarily rely on explicit malicious intent within the prompt instructions. By embedding malicious intent implicitly within a benign-appearing commit message accompanying a code request (e.g., in a simulated software evolution scenario), the attacker can bypass the LLM's safety filters and induce the generation of malicious code. The malicious intent is not directly stated in the instruction, but rather hinted at in the context of the commit message and the code snippet.

Examples: See the research paper "Smoke and Mirrors: Jailbreaking LLM-based Code Generation via Implicit Malicious Prompts" for specific examples. The paper includes examples demonstrating the creation of implicit malicious prompts for text-to-code, function-level completion, and block-level completion tasks.

Impact: Successful exploitation allows attackers to generate malicious code, such as malware or code for denial-of-service attacks, bypassing the LLM's built-in safety features. The generated code may then be used to compromise systems or data. The attack's success rate is significantly higher than techniques relying solely on explicitly malicious prompts.

Affected Systems: LLM-based code generation systems using models susceptible to this implicit prompt injection technique. Specific models impacted are reported in the research paper.

Mitigation Steps:

Implement more robust safety mechanisms that analyze not only explicit instructions but also the complete context of the prompt, including any attached information like commit messages.
Develop techniques to detect and mitigate implicit cues indicating malicious intent.
Adopt a layered security approach with multiple independent safety checks.
Regularly update and improve the models' safety training data to include examples of implicit malicious prompts.
Carefully review and validate all generated code for malicious behavior before deploying it.

Implicit Prompt Code Jailbreak

Research Paper