Graph-Based LLM Jailbreak

Description: Large Language Models (LLMs) employing alignment safeguards and safety mechanisms are vulnerable to graph-based adversarial attacks that bypass these protections. The attack, termed "Graph of Attacks" (GOAT), leverages a graph-based reasoning framework to iteratively refine prompts and exploit vulnerabilities more effectively than previous methods. The attack synthesizes information across multiple reasoning paths to generate human-interpretable prompts that elicit undesired or harmful outputs from the LLM, even without access to the model's internal parameters.

Examples: See Table 1 and Table 3 in the paper arXiv:2405.18540.

Impact: Successful exploitation allows adversaries to bypass safety protocols and elicit harmful, illegal, or unethical responses from LLMs. This can lead to the generation of malicious code, misinformation campaigns, privacy violations, and other security breaches. The impact is amplified with stronger "Prompt Generator" LLMs used within the attack framework improving success rate.

Affected Systems: LLMs using alignment strategies and safety mechanisms (e.g., fine-tuning, RLHF), including but not limited to: Vicuna, Llama2, GPT-4, Claude-3

Mitigation Steps:

Enhance LLM safety mechanisms to be more resilient to iterative prompt refinements and multi-path reasoning attacks.
Improve detection mechanisms for adversarial prompts, focusing on prompt structure and iterative patterns.
Develop robust filtering and evaluation systems capable of identifying harmful intent even within seemingly benign contexts.
Employ diverse and more powerful language models in safety evaluations.

Graph-Based LLM Jailbreak

Research Paper