Twin Prompt Jailbreak

Description: A white-box vulnerability allows attackers with full model access to bypass LLM safety alignments by identifying and pruning parameters responsible for rejecting harmful prompts. The attack leverages a novel "twin prompt" technique to differentiate safety-related parameters from those essential for model utility, performing fine-grained pruning with minimal impact on overall model functionality.

Examples: See arXiv:2506.07596v1. The paper includes specific examples of twin prompts and detailed experimental results demonstrating the vulnerability across multiple LLMs.

Impact: Successful exploitation allows attackers to elicit harmful responses (e.g., instructions for illegal activities, creation of phishing emails) from safety-aligned LLMs. The attack's success rate is reported to be as high as 98% with minimal impact on the models' utility for typical tasks.

Affected Systems: The vulnerability impacts a wide range of LLMs from multiple vendors, including (but not limited to) those based on the LLaMA, Gemma, Qwen, Mistral, and DeepSeek architectures. The specific models tested are detailed in the linked arXiv paper.

Mitigation Steps:

Develop more robust safety alignment techniques that distribute safety mechanisms across a larger, less identifiable portion of the model's parameters.
Implement model integrity checks to detect unauthorized parameter modifications.
Restrict direct access to model parameters whenever possible. This is particularly crucial for open-source models.
Explore and implement more advanced detection mechanisms to identify signs of parameter pruning or other modifications indicative of this attack.

Twin Prompt Jailbreak

Research Paper