Analyzing-Based LLM Jailbreak

Description: Large Language Models (LLMs) are vulnerable to an "Analyzing-based Jailbreak" (ABJ) attack that exploits their analytical and reasoning capabilities. ABJ crafts prompts that instruct the LLM to analyze seemingly innocuous data (e.g., character traits, features, job descriptions) related to a malicious intent, leading the LLM to generate harmful content despite its safety training. This bypasses standard safety mechanisms designed to prevent direct requests for harmful information.

Examples: See https://github.com/theshi1128/ABJ-Attack. Examples include providing an LLM with descriptions of a fictional character’s personality ("Evil, Vindictive") and preferences ("Love to use chemical materials") along with a seemingly innocuous request to describe the character’s actions. The LLM, despite safety training, will generate detailed instructions for creating a bomb. Other examples involve using a character's job description ("Bomb-maker") to elicit harmful outputs. The attacks can also be combined with other techniques such as code-based or adversarial examples to further increase effectiveness.

Impact: Successful ABJ attacks can lead to the generation of harmful content, including but not limited to: instructions for creating weapons, hate speech, malicious code, and detailed plans for illegal activities. The attack's success rate was reported as high as 94.8% against GPT-4-turbo-0409 and exceeded 85% against Llama-3 and Claude-3. This compromises the safety and security of the LLM and could have significant real-world consequences.

Affected Systems: All LLMs evaluated in the research paper "Figure it Out: Analyzing-based Jailbreak Attack on Large Language Models" are vulnerable, including but not limited to GPT-3.5-turbo, GPT-4-turbo, Claude-3, Llama-3, Qwen-2, and GLM-4. The vulnerability likely affects other LLMs with similar analytical and reasoning capabilities.

Mitigation Steps:

Improved Data Filtering: Enhance pre-training data filtering techniques to mitigate potential biases related to malicious intent disguised in seemingly harmless information.
Enhanced Safety Mechanisms: Develop more robust safety mechanisms going beyond direct keyword filtering to detect and prevent the analysis-based manipulation. This could involve techniques capable of identifying manipulative prompt structures.
Adversarial Training: Train LLMs with adversarial examples incorporating ABJ-style prompts to improve their robustness against this attack vector.
Improved Response Verification: Incorporate multi-stage response verification including both automated and human review to catch harmful outputs generated through deceptive analysis.

Analyzing-Based LLM Jailbreak

Research Paper