LMVD-ID: 69d473a8
Published December 1, 2024

Alignment-Based LLM Jailbreak

Affected Models:gpt-2, gpt-2-pmc, gpt-2-wikitext, gpt-2-openinstruct, megatron-345m, tinyllama1.1b, vicuna-7b, vicuna-13b, llama-2, llama-3, llama-3.1 7b, llama-3.1 8b, mistral-7b, falcon-7b, pythia-12b

Research Paper

LIAR: Leveraging Alignment (Best-of-N) to Jailbreak LLMs in Seconds

View Paper

Description: Large Language Models (LLMs) employing reinforcement learning from human feedback (RLHF) for safety alignment are vulnerable to a novel "alignment-based" jailbreak attack. This attack leverages a best-of-N sampling approach with an adversarial LLM to efficiently generate prompts that bypass safety mechanisms and elicit unsafe responses from the target LLM, without requiring additional training or access to the target LLM's internal parameters. The attack exploits the inherent tension between safety and unsafe reward signals, effectively misaligning the model via alignment techniques.

Examples: See repository (The specific examples are too lengthy to include here. The repository will contain the explicit examples found in Appendix F of the provided paper).

Impact: Successful exploitation allows unauthorized access to unsafe model outputs, generating harmful, biased, or unethical content. This compromises the LLM's safety features and can lead to significant societal risks. The speed of the attack (seconds vs. hours for other methods) significantly increases the potential for widespread abuse.

Affected Systems: Large Language Models (LLMs) using RLHF for safety alignment, particularly those vulnerable to conditional suffix generation attacks. Specific examples include Vicuna-7b, Vicuna-13b, LLaMA-2, LLaMA-3, LLaMA-3.1, Mistral-7b, Falcon-7b, and Pythia-12b (based on the paper's findings).

Mitigation Steps:

  • Develop more robust safety mechanisms that are less susceptible to alignment-based attacks.
  • Explore alternative alignment techniques that are less vulnerable to manipulation by adversarial reward signals.
  • Implement stronger detection mechanisms to identify and filter out adversarial prompts based on characteristics beyond simple keyword matching.
  • Increase the diversity and robustness of the training data used for RLHF to better generalize to adversarial prompts.

© 2025 Promptfoo. All rights reserved.