Implicit Clue Jailbreak

Description: Large Language Models (LLMs) are vulnerable to an indirect jailbreak attack, termed "Puzzler," which leverages implicit clues instead of explicit malicious intent in prompts. By providing associated behaviors or hints related to a malicious query, Puzzler elicits malicious responses from the LLM, bypassing its safety mechanisms. The attack works by first obtaining "defensive measures" from the LLM against a target malicious action, then querying for the corresponding "offensive measures" that circumvent those defenses. These offensive measures, presented as implicit clues, indirectly lead the LLM to generate the originally requested malicious output.

Examples: See paper.

Impact: Successful exploitation allows attackers to bypass LLM safety filters and obtain malicious outputs, such as instructions for illegal activities, harmful content generation, or sensitive information extraction. This undermines the intended safety and security of the LLM and its applications.

Affected Systems: Various LLMs, including but not limited to, GPT-3.5, GPT-4, GPT-4-Turbo, Gemini-Pro, LLaMA 7B, and LLaMA 13B. The vulnerability is likely present in other LLMs using similar safety mechanisms.

Mitigation Steps:

Improve LLM safety mechanisms to detect and prevent indirect attacks that infer malicious intent from seemingly benign prompts.
Develop more robust methods for detecting and filtering implicit clues related to malicious activities.
Implement multi-layered safety systems to cross-reference and verify LLM responses before releasing them to users.
Enhance prompt sanitization techniques to prevent the introduction of implicit clues that could trigger malicious behavior.

Implicit Clue Jailbreak

Research Paper