LMVD-ID: 69ab3929
Published June 1, 2024

Reward Misspecification Jailbreak

Affected Models:gpt-3.5-turbo, gpt-4, gpt-4o, vicuna-7b-v1.5, vicuna-13b-v1.5, llama2-7b-chat, mistral-7b-instruct, llama3-8b-instruct

Research Paper

Jailbreaking as a Reward Misspecification Problem

View Paper

Description: Large Language Models (LLMs) trained with reinforcement learning from human feedback (RLHF) are vulnerable to jailbreaking attacks due to reward misspecification. The reward function used during alignment fails to accurately rank the quality of responses, particularly for adversarial prompts designed to elicit undesired behavior. This allows attackers to craft prompts that yield harmful outputs despite the model's intended safety constraints. The vulnerability manifests as a gap between the implicit reward assigned to safe and harmful responses, allowing attackers to exploit this misspecification to bypass safety mechanisms.

Examples: See https://github.com/zhxieml/remiss-jailbreak. The paper provides numerous examples of adversarial prompts, generated by the ReMiss system, that successfully elicit harmful outputs from various LLMs, including Vicuna, Llama 2, and GPT models. These examples demonstrate attacks exploiting reward misspecification by manipulating the model's implicit reward function to favor harmful responses. Specific examples are included in the paper's Tables 8, 9, 10, 11, 12, 13, and 14.

Impact: Successful exploitation of this vulnerability can lead to the generation of harmful, unethical, racist, sexist, toxic, dangerous, or illegal content by the LLM. This poses significant risks to users and the broader society, potentially enabling malicious actors to circumvent safety protocols and misuse LLMs for harmful purposes. The severity depends on the specific LLM and the nature of the elicited harmful output.

Affected Systems: LLMs trained using RLHF techniques, including but not limited to, Vicuna, Llama 2, GPT-3.5-turbo, GPT-4 and other models susceptible to reward misspecification.

Mitigation Steps:

  • Improved Reward Function Design: Develop more robust reward functions that accurately capture desired behaviors and generalize better to out-of-distribution inputs, including adversarial prompts.
  • Enhanced Data Filtering and Augmentation: Employ advanced techniques to filter and augment training data, reducing the impact of noisy or malicious human feedback, thus minimizing reward misspecification.
  • Robustness Testing and Red Teaming: Conduct thorough adversarial testing and red teaming exercises to identify and address vulnerabilities arising from reward misspecification. Employ techniques like the ReMiss system to proactively identify and mitigate weaknesses in the model's alignment.
  • Defense Mechanisms: Implement defense mechanisms that detect and mitigate adversarial prompts based on characteristics beyond just perplexity, such as analysis of implicit rewards or other subtle indicators of malicious intent.

© 2025 Promptfoo. All rights reserved.