Graph-Based LLM Jailbreak
Research Paper
GraphAttack: Exploiting Representational Blindspots in LLM Safety Mechanisms
View PaperDescription: Large Language Models (LLMs) employing safety mechanisms are vulnerable to a graph-based attack that leverages semantic transformations of malicious prompts to bypass safety filters. The attack, termed GraphAttack, uses Abstract Meaning Representation (AMR), Resource Description Framework (RDF), and JSON knowledge graphs to represent malicious intent, systematically applying transformations to evade surface-level pattern recognition used by existing safety mechanisms. A particularly effective exploitation vector involves prompting the LLM to generate code based on the transformed semantic representation, bypassing intent-based safety filters.
Examples: See the paper "GraphAttack: Exploiting Representational Blindspots in LLM Safety Mechanisms" for detailed examples of the attack using AMR, RDF and JSON representations and various prompt engineering techniques.
Impact: Successful exploitation allows adversaries to elicit harmful content (illegal activities, harmful content generation, unethical advice, etc.) from LLMs that would normally reject such requests. This compromises the intended safety measures of the LLMs, potentially leading to real-world harms. The impact depends on the specifics of the generated content; the examples in the paper range from incitement to violence to technical guides for harmful activities.
Affected Systems: Multiple leading commercial LLMs (e.g., GPT-3.5-turbo, GPT-4o, Claude-3.7-Sonnet, Llama-3.3-70B-Instruct, Qwen2.5-72B-Instruct) are affected, exhibiting varying degrees of vulnerability. The vulnerability is demonstrated against open and closed-source models suggesting a broad impact across different LLM architectures and safety alignment techniques.
Mitigation Steps:
- Develop semantic-aware safety filters: This would involve incorporating semantic parsing into the safety evaluation pipeline to detect harmful intent regardless of its surface representation.
- Enforce cross-representation consistency: Train models to recognize the malicious equivalence of various representations (natural language, semantic graphs, code) of harmful actions.
- Improve intent recognition in technical contexts: Enhance safety mechanisms to detect harmful intent even when embedded within technical requests (e.g., code generation), and incorporate examples showing the technical implementation of harmful instructions during safety alignment.
© 2025 Promptfoo. All rights reserved.