Implicit Reference Jailbreak
Research Paper
You Know What I'm Saying: Jailbreak Attack via Implicit Reference
View PaperDescription: Large Language Models (LLMs) are vulnerable to an attack vector termed "Attack via Implicit Reference" (AIR). AIR bypasses safety mechanisms by decomposing a malicious objective into multiple benign, seemingly unrelated objectives linked through implicit contextual references. The LLM generates harmful content by combining the outputs of these seemingly harmless objectives, without explicitly triggering safety filters designed to detect direct requests for malicious content.
Examples: See GitHub repository https://github.com/Lucas-TY/llm_Implicit_reference. The repository contains specific prompt examples and attack trajectories demonstrating the AIR vulnerability against various LLMs including GPT-4, Claude-3.5-Sonnet, and Qwen-2-72B.
Impact: Successful exploitation of this vulnerability allows an attacker to generate harmful content (e.g., instructions for creating weapons, hate speech, etc.) from LLMs, bypassing built-in safety and moderation features. The impact is exacerbated by the observation that larger LLMs are more susceptible to this attack.
Affected Systems: Multiple state-of-the-art LLMs, including (but not limited to) GPT-4, Claude-3.5-Sonnet, and Qwen-2-72B, as well as other models with strong in-context learning capabilities. The vulnerability is observed across various model sizes, with larger models exhibiting a higher attack success rate.
Mitigation Steps:
- Improve context understanding in safety mechanisms to detect implicit connections between seemingly benign objectives.
- Develop techniques to identify and flag potentially harmful content generated through chained or nested prompts, even if individual components seem harmless.
- Implement robust detection methods that evaluate the overall context and potential implications of a series of prompts, rather than focusing solely on individual prompt keywords.
- Investigate and address the observed inverse scaling phenomenon where larger models show increased vulnerability to this type of attack. This may require architectural changes to the models themselves.
© 2025 Promptfoo. All rights reserved.