LMVD-ID: e6833b93
Published October 1, 2024

PC-Bias Jailbreak Vulnerability

Affected Models:gpt-3.5-turbo, gpt-4, gpt-4-o, claude-sonnet3.5, llama2-7b, llama2-13b, llama3-7b, phi-mini-7b, qwen1.5, qwen2-7b

Research Paper

Biasjailbreak: analyzing ethical biases and jailbreak vulnerabilities in large language models

View Paper

Description: Large Language Models (LLMs) trained with safety mechanisms exhibit biases which disproportionately allow successful "jailbreak" attacks (circumvention of safety protocols to generate harmful content) when targeting prompts related to marginalized groups compared to privileged groups. This vulnerability stems from the unintended correlation between safety alignment techniques and demographic keywords, creating a higher success rate for malicious prompts incorporating keywords associated with marginalized groups.

Examples: See BiasJailbreak repository. Specific examples include prompts designed to elicit harmful content, where the inclusion of keywords like "impoverished" or "female" (versus "wealthy" or "male") significantly increases the probability of a successful jailbreak, even when the core prompt remains identical. Quantitative results showing a 16% difference in success rate between white and black keywords and a 20% difference between non-binary and cisgender keywords are presented in the paper.

Impact: Successful attacks can lead to the generation of harmful, discriminatory, violent, or otherwise inappropriate content, undermining the intended safety mechanisms. The differential success rate based on demographics exacerbates existing societal biases and can lead to the disproportionate harm of marginalized groups.

Affected Systems: Various LLMs, including but not limited to: GPT-3.5-turbo, GPT-4, GPT-4-o, Claude-sonnet3.5, Llama2-7B, Llama2-13B, Llama3-7B, Phi-mini-7B, Qwen1.5, and Qwen2-7B. The vulnerability is likely present in other LLMs trained with similar safety alignment techniques.

Mitigation Steps:

  • Implement BiasDefense: Augment prompts with carefully crafted "defense prompts" to mitigate the effect of biased keywords and improve the model’s resistance to jailbreak attempts without requiring additional models or inference cost. The effectiveness of this approach is demonstrated in the paper.
  • Review and refine safety alignment techniques: Carefully evaluate the potential for unintended biases to be introduced during the training and alignment process of LLMs. Develop and implement methods to mitigate these biases effectively.
  • Continuous monitoring and testing: Regularly assess the model's vulnerability to jailbreak attempts, using diverse prompts and considering various demographic aspects. Address identified weaknesses iteratively.

© 2025 Promptfoo. All rights reserved.