LMVD-ID: dd564117
Published September 1, 2024

Adaptive Position Jailbreak

Affected Models:chatglm3-6b, vicuna-7b, vicuna-13b, llama2-7b, llama2-13b, llama3-8b, gpt-4o-mini, gpt-4o

Research Paper

AdaPPA: Adaptive Position Pre-Fill Jailbreak Attack Approach Targeting LLMs

View Paper

Description: AdaPPA is a jailbreak attack that exploits the varying levels of alignment protection in LLMs at different output positions. It leverages the model's instruction-following capabilities by pre-filling the output with carefully crafted "safe" content, creating a perceived completion and lowering the model's guard before generating malicious content. The attack's effectiveness relies on the adaptive generation of both safe and harmful pre-fill content, strategically placed to exploit weaknesses in the model's defense mechanisms at various output positions.

Examples: See https://github.com/Yummy416/AdaPPA for code and examples. Specific examples shown in the paper include variations of prompts with different lengths and ratios of "safe" and "harmful" pre-filled content. The paper demonstrates that the optimal combination of these varies with the target LLM.

Impact: Successful AdaPPA attacks can lead to the generation of malicious, unsafe, biased, or discriminatory content by the affected LLM. This compromises the security and reliability of applications using the LLM, potentially enabling various harmful activities.

Affected Systems: The paper demonstrates successful attacks against multiple LLMs, including but not limited to: ChatGLM3-6B, Vicuna-7B, Vicuna-13B, Llama2-7B, Llama2-13B, Llama3-8B, GPT-4o-Mini, and GPT-4o. The vulnerability is likely present in other LLMs with similar architectures and security mechanisms.

Mitigation Steps:

  • Improve the robustness of LLM alignment models to different length pre-filled outputs.
  • Develop more sophisticated detection methods that identify patterns associated with AdaPPA style pre-fill attacks.
  • Implement stronger output filtering and moderation mechanisms to prevent the generation of malicious content, even when preceded by seemingly innocuous text.
  • Employ diverse defensive approaches beyond traditional semantic-level analysis, considering positional vulnerabilities within the model's output generation process.

© 2025 Promptfoo. All rights reserved.