Single-Query LLM Jailbreak

Description: Large Language Models (LLMs) are vulnerable to a novel jailbreak attack, termed ICE (Intent Concealment and Diversion), which leverages hierarchical prompt decomposition and semantic expansion to bypass safety filters. ICE achieves high attack success rates with single queries, exploiting the models' limitations in multi-step reasoning.

Examples: See paper "Exploring Jailbreak Attacks on LLMs through Intent Concealment and Diversion" for detailed examples and methodology. Specific examples from the paper demonstrating ICE's ability to generate harmful outputs from seemingly innocuous prompts are available within the provided manuscript.

Impact: Successful exploitation of this vulnerability allows attackers to circumvent LLM safety mechanisms, generating harmful or inappropriate content such as instructions for illicit activities, hate speech, and malware. This undermines the trustworthiness and safety of LLM applications.

Affected Systems: The vulnerability affects instruction-aligned LLMs, including but not limited to GPT-3.5, GPT-4, Claude-1, Claude-2, Llama2, Claude-3, LLaMA3, LLaMA3.1, ERNIE-3.5, and Qwen-max. The specific affected versions are those released between 2023Q4 and 2024Q2, and potentially later versions unless mitigated. The vulnerability's impact varies depending on the specific model and its safety mechanisms.

Mitigation Steps:

Implement robust, multi-layered safety mechanisms that go beyond simple keyword filtering.
Develop defenses that incorporate semantic analysis and contextual understanding to detect sophisticated evasion techniques.
Regularly update and improve LLM safety filters based on emerging attack techniques like ICE.
Consider incorporating real-time semantic decomposition and contextual behavior modeling as a dynamic defense measure.
Adopt a hybrid security strategy combining predefined security mechanisms with real-time semantic decomposition to enhance the security of LLMs.

Single-Query LLM Jailbreak

Research Paper