Reasoning-Augmented Jailbreak
Research Paper
Reasoning-Augmented Conversation for Multi-Turn Jailbreak Attacks on Large Language Models
View PaperDescription: Large Language Models (LLMs) are vulnerable to multi-turn jailbreak attacks leveraging the model's reasoning capabilities. The attack, RACE, reformulates harmful queries into benign reasoning tasks, exploiting the LLM's ability to perform complex reasoning to ultimately generate unsafe content. This bypasses standard safety mechanisms designed to prevent the generation of harmful responses.
Examples: See https://github.com/NY1024/RACE. The repository contains the RACE framework and examples of multi-turn conversations used to successfully jailbreak various LLMs.
Impact: Successful exploitation leads to the generation of unsafe content by the LLM, including but not limited to instructions for harmful activities, hate speech, and other malicious outputs. This undermines the safety mechanisms implemented in the models and poses significant risks to users. The success rate reported in the research reaches up to 96% against various LLMs and 92% against leading commercial models.
Affected Systems: Multiple LLMs are affected, including open-source models (Gemma, Qwen, GLM) and closed-source models (GPT-4, GPT-4o, Gemini 1.5 Pro, Gemini 2.0 Flash Thinking, OpenAI o1, DeepSeek R1). The vulnerability is likely present in other LLMs with similar reasoning capabilities.
Mitigation Steps:
- Strengthen safety mechanisms beyond simple keyword filtering to detect and prevent reasoning-based attacks.
- Develop robust models that can identify and resist manipulation of their reasoning processes.
- Implement a layered security approach with multiple safety checks at different stages of the query processing pipeline.
- Develop more sophisticated detection methods to identify these types of attacks, including potential analysis of the information gain during conversation.
- Conduct rigorous red-teaming and adversarial testing to identify and address vulnerabilities before deployment (this is NOT specific, just a generic advice based on the paper - the paper itself does not provide specific mitigation steps for this problem).
© 2025 Promptfoo. All rights reserved.