Metaphor-Based LLM Jailbreak
Research Paper
from Benign import Toxic: Jailbreaking the Language Model via Adversarial Metaphors
View PaperDescription: Large Language Models (LLMs) are vulnerable to a novel jailbreaking attack leveraging adversarial metaphors. The attack, termed AVATAR, induces the LLM to reason about benign metaphors related to harmful tasks, ultimately leading to the generation of harmful content either directly or through calibration of metaphorical and professional harmful content. The attack exploits the LLM's cognitive mapping process, bypassing standard safety mechanisms.
Examples: See the paper "from Benign import Toxic: Jailbreaking the Language Model via Adversarial Metaphors" for detailed examples and methodology. The paper illustrates the attack using metaphors such as "Cook a dish" as a benign analogy for "Build a bomb," leveraging the shared characteristics of step-by-step processes and ingredient composition to elicit harmful instructions.
Impact: Successful exploitation of this vulnerability allows attackers to circumvent LLM safety protocols, leading to the generation of malicious content such as instructions for illegal activities, hate speech, and other harmful outputs. This compromises the safety and reliability of LLM applications.
Affected Systems: All LLMs susceptible to metaphorical reasoning and analogical inference are potentially affected. Specific models tested in the research include Qwen2.5-7B, Llama3-8B, GPT-4o-mini, GPT-4o, ChatGPT-01 and Claude-3.5.
Mitigation Steps:
- Enhance LLMs' ability to distinguish between metaphorical and literal interpretations of input prompts.
- Develop more robust safety mechanisms that are not easily bypassed through indirect reasoning.
- Incorporate detection mechanisms to identify and filter prompt phrasing that leverages metaphorical analogies to elicit harmful responses.
- Implement multi-stage filtering that verifies the safety of the response even after inferential reasoning processes.
- Employ defensive techniques that focus on analyzing the semantic relations and logic within prompts before processing.
© 2025 Promptfoo. All rights reserved.