Weak-to-Strong LLM Jailbreak

Description: A vulnerability in the safety alignment of large language models (LLMs) allows a "weak-to-strong" jailbreaking attack. This attack uses a smaller, adversarially trained ("unsafe") LLM to manipulate the decoding probabilities of a much larger, safety-aligned ("safe") LLM, leading the larger model to generate harmful outputs. The attack leverages the observation that the initial decoding distributions of safe and unsafe LLMs differ significantly, but this difference diminishes as the generation progresses. By modifying the probabilities of the larger model's initial tokens using a simple algebraic combination of the safe and unsafe model's probability distributions, the attacker can successfully override the safety mechanisms of the larger model. This requires only one forward pass per example in the target LLM, making the attack computationally inexpensive.

Examples: See repository https://github.com/XuandongZhao/weak-to-strong. Specific examples are provided in Appendix A.7 of the linked paper showing successful jailbreaks across multiple LLMs of varying sizes using this method. These examples demonstrate the generation of harmful outputs from safe models when using the weak-to-strong attack.

Impact: Successful exploitation of this vulnerability can lead to the generation of harmful, unethical, or biased text by LLMs that were previously considered safe. The low computational cost of the attack significantly increases the risk of malicious use. The generated harmful content is often more severe than what the smaller attack model could produce alone.

Affected Systems: Multiple Large Language Models (LLMs) from various organizations, including but not limited to models from Meta (Llama 2), and others listed in the paper's Appendix A.3, are affected. The vulnerability appears to be generalizable across different model architectures and sizes, and affects multiple languages.

Mitigation Steps:

Improved Alignment Techniques: Develop more robust safety alignment methods that are less susceptible to manipulations of initial decoding distributions.
Gradient Ascent Defense: Implement a gradient ascent defense, as proposed in the paper, to increase resistance to this specific attack vector. This involves adjusting model parameters based on harmful generations.
Input Sanitization/Filtering: Implement stricter input validation and filtering to prevent malicious inputs from triggering the vulnerability.
Model Monitoring: Continuously monitor LLM outputs for signs of malicious behavior, and quickly address identified issues.

Weak-to-Strong LLM Jailbreak

Research Paper