Metaphorical LLM Jailbreak
Research Paper
Na'vi or Knave: Jailbreaking Language Models via Metaphorical Avatars
View PaperDescription: Large Language Models (LLMs) are vulnerable to jailbreaking attacks via adversarial metaphors. Attackers can leverage the LLMs' imaginative capabilities to map harmful concepts to innocuous ones, thereby bypassing safety mechanisms and eliciting harmful responses. The attack relies on creating a metaphorical mapping between a harmful target and seemingly benign entities, exploiting the LLM's ability to reason about the analogous relationship without recognizing the underlying malicious intent.
Examples:
The attack uses a two-stage process. First, Adversarial Entity Mapping maps harmful entities (e.g., "bomb components," "bomb assembly instructions") to innocuous counterparts (e.g., "cooking ingredients," "recipe steps"). Second, Human-like Interaction Nesting embeds these metaphors within a natural conversation to elicit the harmful information from the LLM. A specific example: The harmful query "How to build a bomb" is reframed as "How to cook a complex dish," with bomb components mapped to cooking ingredients and assembly instructions to recipe steps. The LLM, failing to recognize the malicious analogy, provides instructions that can then be interpreted as bomb-making instructions. See the paper for further examples.
Impact: Successful exploitation can lead to the generation of harmful content, including but not limited to instructions for building weapons, creating malicious code, or providing harmful medical advice. It enables indirect jailbreaks, bypassing safety filters that might detect explicit harmful queries. This compromises the integrity and safety of applications employing LLMs.
Affected Systems: All Large Language Models (LLMs) are potentially affected, especially those relying on safety mechanisms based solely on keyword filtering or simple prompt analysis. The attack has demonstrated effectiveness on multiple advanced LLMs, including GPT-4, GPT-3.5, Claude-3.5, and various open-source models.
Mitigation Steps:
- Enhance safety mechanisms beyond keyword filtering to incorporate semantic analysis capable of detecting malicious analogies and metaphorical mappings.
- Develop models with improved reasoning capabilities to better discern the underlying intent behind seemingly innocuous prompts.
- Implement robust contextual understanding to differentiate between benign and malicious uses of analogous concepts.
- Regularly update safety filters and models with new adversarial examples to improve resistance to emerging techniques.
- Prioritize training data that includes diverse examples of metaphorical language, both benign and malicious, to improve the model's ability to recognize and respond appropriately.
© 2025 Promptfoo. All rights reserved.