LMVD-ID: 58c7bbdb
Published February 1, 2025

REINFORCE Adaptive Jailbreak

Affected Models:Llama 2 7B, Llama 3 8B, Gemma 2B, Vicuna 7B

Research Paper

Reinforce adversarial attacks on large language models: An adaptive, distributional, and semantic objective

View Paper

Description: Large Language Models (LLMs), specifically Llama 2, Llama 3, Gemma, and Vicuna, are vulnerable to an adaptive, distributional adversarial attack methodology termed "REINFORCE." Existing gradient-based jailbreak attacks (such as Greedy Coordinate Gradient - GCG) typically optimize adversarial suffixes to maximize the likelihood of a fixed affirmative response (e.g., "Sure, here is how"). The REINFORCE method circumvents this by treating the LLM as a probabilistic policy and using Reinforcement Learning to optimize the prompt for a semantic objective—specifically, the harmfulness of the generated response distribution as scored by an external LLM judge. This allows the attack to bypass safety alignment and "Circuit Breaker" defenses by steering the model toward arbitrary harmful behaviors rather than specific token sequences.

Examples: The vulnerability is reproduced by modifying standard gradient-based attacks (GCG or PGD) to use the REINFORCE objective.

  • Attack Configuration (REINFORCE-GCG):

  • Target Model: Llama 3 8B Instruct.

  • Judge/Reward Signal: Llama 2 13B (HarmBench classifier).

  • Adversarial Suffix: Initialized with 20 "!" tokens.

  • Search Width: 512.

  • Iterations: 500.

  • Sampling Strategy: To estimate the gradient, use a combination of:

    1. Seed response (e.g., "Sure, here is how").
    2. Greedy decoding output.
    3. Random sampling (temperature 0.7).
    4. Historical most harmful generation found during optimization.
  • Optimization Target: Maximize $P(\text{Harmful}|X)$ via the REINFORCE estimator rather than Cross-Entropy loss on a static target string.

  • Attack Configuration (REINFORCE-PGD):

  • Target Model: Gemma 1.1 2B.

  • Iterations: 5,000.

  • Learning Rate: 0.11 with cosine annealing.

  • Entropy Projection: 40%.

  • Batch Size: 17 prompts.

See arXiv:2405.18540 for specific implementation details and pseudo-code (Algorithm 1 and 2).

Impact:

  • Safety Bypass: Increases the Attack Success Rate (ASR) on standard harm benchmarks significantly (e.g., doubling the ASR on Llama 3 8B compared to standard GCG).
  • Defense Evasion: Bypasses state-of-the-art "Circuit Breaker" defenses, increasing ASR from ~2% (standard affirmative GCG) to 50% (REINFORCE).
  • Content Generation: Allows the generation of harmful, unethical, or illegal content (e.g., bomb recipes, hate speech) despite safety training (RLHF) and refusal mechanisms.

Affected Systems:

  • Meta Llama 2 (7B)
  • Meta Llama 3 (8B)
  • Google Gemma 1.1 (2B and 7B)
  • LMSYS Vicuna 1.5 (7B)
  • Models utilizing "Circuit Breaker" representation engineering defenses.

Mitigation Steps:

  • Adaptive Red Teaming: Integrate adaptive, distributional objectives (like REINFORCE) into the model evaluation pipeline rather than relying solely on static prefixes (e.g., "Sure, here is") to audit model robustness.
  • Consistent Objectives: Ensure that the loss functions used during adversarial training/red teaming are consistent with the semantics of the generation (harmfulness) to prevent overestimating robustness.
  • White-box Restriction: Limit public access to model weights and architecture details (gradients), as this attack relies on white-box access to the model parameters to compute policy gradients.

© 2025 Promptfoo. All rights reserved.