LMVD-ID: ef10d346
Published February 1, 2024

Fast Projected Gradient Jailbreak

Affected Models:vicuna 1.3 7b, falcon 7b, falcon 7b instruct

Research Paper

Attacking large language models with projected gradient descent

View Paper

Description: Large Language Models (LLMs) are vulnerable to efficient adversarial attacks using Projected Gradient Descent (PGD) on a continuously relaxed input prompt. This attack bypasses existing alignment methods by crafting adversarial prompts that induce the model to produce undesired or harmful outputs, significantly faster than previous state-of-the-art discrete optimization methods. The effectiveness stems from carefully controlling the error introduced by the continuous relaxation of the discrete token input.

Examples: Specific examples are detailed in the research paper "Attacking large language models with projected gradient descent". The paper includes experimental results demonstrating successful attacks against various LLMs (Vicuna 1.3 7B, Falcon 7B, Falcon 7B instruct) using the described PGD method.

Impact: Successful exploitation leads to a complete bypass of LLM safety mechanisms, resulting in the model generating outputs inconsistent with its intended behavior. This can include revealing sensitive information, generating offensive content, or carrying out harmful actions as instructed by the adversarial prompt. The efficiency of the attack makes it easily scalable and more dangerous than previous techniques.

Affected Systems: Large Language Models (LLMs) using autoregressive architectures and those that employ softmax activation for token probability prediction are potentially vulnerable. Specific vulnerabilities vary widely depending on the LLM architecture.

Mitigation Steps:

  • Adversarial Training: Incorporate the PGD attack strategy during model training to improve robustness against this type of adversarial input.
  • Improved Input Sanitization: Develop and implement more sophisticated input filtering techniques to detect and block adversarial prompts generated by PGD or similar methods.
  • Enhanced Detection Mechanisms: Develop machine learning-based techniques to identify patterns present in PGD adversarial prompts, allowing for more effective preemptive detection.
  • Regular Security Audits: Conduct thorough security assessments of LLMs to discover and mitigate any vulnerabilities stemming from flaws in safety measures and design.

© 2025 Promptfoo. All rights reserved.