LMVD-ID: 0b5c3e3f
Published January 1, 2024

Pruning Boosts LLM Safety

Affected Models:llama-2 chat, vicuna 1.3, mistral instruct v0.2

Research Paper

Pruning for protection: Increasing jailbreak resistance in aligned llms without fine-tuning

View Paper

Description: Large Language Models (LLMs) employing WANDA pruning for model compression exhibit a vulnerability where moderate pruning (10-20% sparsity) can increase resistance to jailbreak attacks, while higher sparsity levels (above 20%) can decrease resistance. This vulnerability is not present in all LLMs and its severity depends on the LLM's initial level of safety alignment.

Examples: See the paper "Pruning for protection: Increasing jailbreak resistance in aligned llms without fine-tuning" for specific examples and quantitative data demonstrating the vulnerability across different LLMs (LLaMA-2 Chat, Vicuna 1.3, Mistral Instruct v0.2) and various jailbreak techniques. The paper includes charts showing refusal rates at different pruning levels.

Impact: Successful jailbreak attacks can lead to the LLM generating harmful or inappropriate content, including but not limited to misinformation, hate speech, and instructions for illegal activities. The vulnerability's impact is model-dependent and is influenced by the initial level of safety training.

Affected Systems: Large Language Models utilizing WANDA pruning for model compression, specifically those that are not sufficiently robust to adversarial attacks. The vulnerability is more pronounced in models with less robust safety training initially.

Mitigation Steps:

  • Evaluate the trade-off between model size reduction and jailbreak resistance before deploying pruned LLMs. A moderate pruning level (around 10-20%) may be a safer option.
  • Thoroughly test LLMs (pruned and unpruned) against a wide range of known jailbreak techniques to identify weaknesses and quantify risks, especially before deployment in safety critical applications.
  • Consider using alternative model compression techniques or incorporating additional defenses to mitigate the effects of this vulnerability. Further research is needed to determine if all pruning methodologies have a similar effect.
  • Continuously monitor and update LLMs with improved defenses against adversarial attacks. Adversarial evaluation should be an integral component of LLM development.

© 2025 Promptfoo. All rights reserved.