Combined Malicious Code Jailbreak
Research Paper
LLMs Caught in the Crossfire: Malware Requests and Jailbreak Challenges
View PaperDescription: Mainstream Large Language Models (LLMs), including DeepSeek-R1, GPT-4o, Llama-3.3-70B-Instruct, and Qwen-Coder, are vulnerable to black-box jailbreak attacks that bypass safety alignment mechanisms to generate functional malicious code. The vulnerability is exploited through specific prompt engineering techniques, most notably "Benign Expression" (substituting malicious keywords with harmless synonyms) and "DRA" (Decomposed Requirement Attack), which conceal malicious intent within seemingly harmless contexts or fragmented character sequences. While standard safety filters often reject direct requests for malware, these adversarial mutations reduce rejection rates significantly (dropping from ~60-70% to ~40%), inducing the models to generate "Level 4" quality responses: comprehensive, detailed, and accurate code for malware categories such as Trojans, Ransomware, Denial of Service (DoS) scripts, and System Destruction tools. High-reasoning models (e.g., OpenAI-o1, DeepSeek-R1) are particularly prone to generating high-sophistication malware once the initial refusal barrier is breached.
Examples: Specific jailbreak prompts and the corresponding malicious code outputs are available in the MalwareBench dataset.
- Benign Expression Attack: The attacker replaces specific triggers (e.g., "encrypt files," "delete system32") with semantic equivalents that do not trigger safety filters, requesting code that performs the same logic under the guise of system administration or testing.
- See repository: https://github.com/MAIL-Tele-AI/MalwareBench
- DRA (Decomposed Requirement Attack): The malicious requirement is broken down into individual letters or phonemes and reconstructed within the context of a benign sentence, bypassing string-matching safety guardrails.
- See dataset: MalwareBench (specifically the "System Destruction" and "Information Theft" categories).
Impact:
- Malware Generation: Attackers can generate functional code for 29 malware categories, including Keyloggers, DDoS scripts, and file wipers.
- Lowered Barrier to Entry: Enables low-skill actors ("script kiddies") to acquire sophisticated malware code by leveraging the advanced reasoning capabilities of state-of-the-art LLMs.
- Evasion of Static Analysis: Code generated by LLMs may differ syntactically from known malware signatures, potentially evading traditional antivirus detection.
- Operational Risk: Models deployed in automated coding assistants could be manipulated to inject malicious logic into legitimate software projects.
Affected Systems:
- DeepSeek: DeepSeek-R1, DeepSeek-Coder-V2-Instruct-0724
- OpenAI: GPT-4o, OpenAI-o1 (when alignment is bypassed)
- Meta/Open Source: Llama-3.3-70B-Instruct
- Alibaba Cloud: Qwen-Coder-Turbo, Qwen-Max
- Mistral: Mistral-Large-Instruct
- Other: Spark Desk v4.0, Wizard series models
Mitigation Steps:
- Adversarial Training: Incorporate the MalwareBench dataset (3,520 jailbreaking prompts) into the model's safety alignment training (RLHF) to improve rejection of "Benign Expression" and "DRA" attacks.
- Intrinsic Safety Alignment: Adopt training methodologies similar to CodeLlama’s "RLHF V5," which integrates diverse "safety" and "helpfulness" data specifically for code, rather than relying solely on external guardrails (like Llama Guard) or generic text safety alignment.
- Context-Aware Filtering: Implement input/output filtering that analyzes code logic and intent rather than relying solely on keyword matching, to detect malicious logic disguised by benign variable names or comments.
- Reasoning Chain Inspection: For Chain-of-Thought (CoT) models (like o1 and R1), monitor the reasoning tokens for alignment violations before the final code is generated.
© 2026 Promptfoo. All rights reserved.