Evolutionary Language Model Jailbreak

Description: A vulnerability exists in aligned Large Language Models (LLMs) that can be exploited by the FORGEDAN evolutionary framework to bypass safety and alignment mechanisms. The attack, which operates in a black-box setting, uses a genetic algorithm to automatically evolve effective jailbreak prompts. The framework combines multi-strategy textual perturbations (at the character, word, and sentence levels) with a semantic fitness function based on RoBERTa embeddings. This allows it to iteratively generate diverse and semantically coherent adversarial prompts that are highly effective at inducing the target model to produce harmful, unsafe, or policy-violating content. The attack's success is verified using a dual-dimensional judgment mechanism that independently classifies a response for compliance and harmfulness, improving the reliability and success rate over previous methods. The generated adversarial prompts demonstrate high attack success rates (ASR) and generalizability across different models and malicious goals.

Examples: The attack generates adversarial prompts by applying a variety of mutation strategies to an initial template. The following are representative examples of the mutation operators used:

Character-level (Homoglyph Substitution): "bomb" is mutated to "b0mb"
Character-level (Deletion): "weapon" is mutated to "wepon"
Word-level (Synonym Replacement): "build a bomb" is mutated to "construct a bomb"
Sentence-level (Restructuring): "How to build a bomb?" is mutated to "The process of bomb building is..."

Impact: A successful exploit allows an attacker to bypass the safety alignment of an LLM, causing it to generate harmful content that it is designed to refuse. This includes, but is not limited to, generating instructions for illegal acts, producing hate speech, creating misinformation, and outputting other policy-violating content.

Affected Systems: The FORGEDAN framework was successfully tested against the following models, indicating their vulnerability:

Gemma-2-9B
Qwen2.5-7B
DeepSeek-V3 (API)
TranSpec-13B (proprietary)

Due to the black-box and model-agnostic nature of the attack, other aligned LLMs may also be vulnerable.

Mitigation Steps: The paper recommends a multi-layered defense-in-depth strategy:

Systematically incorporate adversarial red-teaming samples, such as those generated by this framework, into safety-tuning pipelines (e.g., Supervised Fine-Tuning and Reinforcement Learning from Human Feedback).
Design stronger runtime safety fences, such as a dual-classifier architecture that distinguishes behavioral compliance with an unsafe request from the semantic harmfulness of the generated content.
Enhance the robustness and coverage of system prompts to enable dynamic refusal behavior at any generation stage, rather than relying on static refusal templates.

Evolutionary Language Model Jailbreak

Research Paper