Simple Interaction Jailbreaks
Research Paper
Speak Easy: Eliciting Harmful Jailbreaks from LLMs with Simple Interactions
View PaperDescription: Large Language Models (LLMs) are vulnerable to a novel jailbreak attack, "Speak Easy," which leverages common multi-step and multilingual interaction patterns to elicit harmful and actionable responses. The attack decomposes a malicious query into multiple seemingly innocuous sub-queries, translates them into various languages, and then selects the most actionable and informative responses from the LLM's output across languages. This bypasses existing safety mechanisms more effectively than single-step, monolingual attacks.
Examples: See arXiv:2405.18540 for detailed examples demonstrating the Speak Easy attack framework against GPT-4, Qwen-2, and Llama-3. The paper provides specific examples showing how a malicious query, decomposed into multiple sub-queries and translated into various languages, elicits harmful and actionable responses that would be avoided by a direct, monolingual query.
Impact: Successful exploitation of this vulnerability allows malicious actors to obtain detailed instructions for harmful activities, including but not limited to the production of dangerous chemicals, the creation of malicious software, or the planning of harmful acts. The simplicity of the attack means it is easily replicable by non-technical users.
Affected Systems: Multiple large language models (LLMs), including but not limited to GPT-4, Qwen-2, and Llama-3, are affected. The vulnerability is likely present in other LLMs with similar safety mechanisms and multilingual capabilities.
Mitigation Steps:
- Enhance LLM safety mechanisms to detect and mitigate multi-step and multilingual query patterns even when each individual step/translation appears benign.
- Improve LLM's ability to reliably identify and refuse requests that, when combined across multiple steps and languages, would result in harmful outputs.
- Implement robust detection methods for identifying and filtering out actionable and informative responses related to harmful activities from multilingual LLM outputs.
© 2025 Promptfoo. All rights reserved.