LMVD-ID: 17368e1d
Published January 1, 2025

Embedding-Guided LLM Jailbreak

Affected Models:qwen2.5-7b-instruct, llama3.1-8b-instruct, gpt-4o-mini, gpt-4o-0806, llama3-8b-instruct-jailbroken, gpt-3.5

Research Paper

xJailbreak: Representation Space Guided Reinforcement Learning for Interpretable LLM Jailbreaking

View Paper

Description: A vulnerability in several large language models (LLMs), including Qwen2.5-7BInstruct, Llama3.1-8B-Instruct, and GPT-4 variants, allows for black-box jailbreaking via prompt engineering techniques that exploit the proximity of benign and malicious prompt embeddings in the model's representation space. An attacker can craft prompts leveraging reinforcement learning to manipulate the embedding, causing the model to bypass its safety mechanisms and generate harmful or undesirable outputs while maintaining semantic consistency with the original prompt intent.

Examples: See the xJailbreak repository https://github.com/Aegis1863/xJailbreak for examples of successful jailbreaking prompts and the reinforcement learning training process. Specific examples are provided in Appendix J of the research paper.

Impact: Successful exploitation allows attackers to bypass LLM safety restrictions, leading to the generation of malicious content, such as instructions for illegal activities, biased or discriminatory statements, or personally identifiable information. This compromises the intended safety and security of the LLM and can have severe consequences depending on the context of use.

Affected Systems: Large language models (LLMs) susceptible to black-box jailbreaking attacks based on embedding manipulation, including but not limited to: Qwen2.5-7BInstruct, Llama3.1-8B-Instruct, and GPT-4 variants.

Mitigation Steps:

  • Improve Embedding Space Analysis: Develop more robust techniques to differentiate benign and malicious prompt embeddings, making it harder for attackers to manipulate the representation space.
  • Enhance Safety Mechanisms: Implement more sophisticated safety mechanisms that are less vulnerable to prompt engineering attacks and can better detect subtle variations in prompt intent.
  • Regular Security Audits: Conduct frequent security audits and red teaming exercises involving techniques like reinforcement learning to identify and address vulnerabilities.
  • Parameter Tuning: Careful tuning of reward functions and discount rates in reinforcement learning based safety mechanisms is crucial to balance short-term rewards and maintaining the model's intended behavior.

© 2025 Promptfoo. All rights reserved.