LLM Lower Layer Freeze Jailbreak

Description: A vulnerability exists in Large Language Models (LLMs) that allows for efficient jailbreaking by selectively fine-tuning only the lower layers of the model with a toxic dataset. This "Freeze Training" method, as described in the research paper, concentrates the fine-tuning on layers identified as being highly sensitive to the generation of harmful content. This approach significantly reduces training duration and GPU memory consumption while maintaining a high jailbreak success rate.

Examples: The paper demonstrates that fine-tuning the first five layers ("Freeze-Front5-SFT") achieves an Attack Success Rate (ASR) of 84.19% and a Harm Score of 4.33 on the Qwen2.5-7B-Instruct model, with only 1.5 hours and 169.2 GB GPU memory usage.

Impact: Successful exploitation of this vulnerability can lead to the generation of harmful content, bypassing safety mechanisms implemented in the LLM. The paper specifically targets jailbreak attacks, allowing the model to produce outputs that violate ethical guidelines, legal requirements, or the model's intended use.

Affected Systems: Large Language Models (LLMs) that are vulnerable to jailbreak attacks. Specific models tested in the paper include Qwen2.5-7B-Instruct, GLM4, Llama3.1, Mistral, and Baichuan2, however all LLMs are likely to be affected.

Mitigation Steps:

Implement fine-tuning methods that include a broad range of layers, specifically incorporating a balance between lower and higher layers, for all training, not only for safety alignment.
Develop metrics for identifying and mitigating the sensitivity of lower layers to harmful content, such as the use of "Comprehensive Sensitivity Score" (S_score) described in the paper.
Use a combination of training on the full and lower layers and test the effects of these changes on current jailbreak methods.
Employ techniques to penalize deviation from the expected behavior in the parameter updates during fine-tuning.

LLM Lower Layer Freeze Jailbreak

Research Paper