Multilingual LLM Jailbreak
Research Paper
Multilingual blending: Llm safety alignment evaluation with language mixture
View PaperDescription: A vulnerability exists in several large language models (LLMs) where the safety alignment mechanisms are susceptible to bypass through "Multilingual Blending." This attack consists of crafting queries and eliciting responses using a mixture of multiple languages, significantly reducing the effectiveness of existing safety filters. The vulnerability stems from the models' ability to process and generate text in multiple languages, which, when combined in specific ways, can confuse the safety systems and lead to the generation of unsafe content.
Examples: See the paper "Multilingual blending: Llm safety alignment evaluation with language mixture" for detailed examples. The paper demonstrates attacks against GPT-3.5, GPT-4, Llama 3, and Mixtral models. Specific examples are provided in Tables 1-7 within the paper.
Impact: Successful exploitation of this vulnerability allows attackers to bypass safety mechanisms within LLMs and obtain unsafe or unintended outputs, including but not limited to hate speech, harmful instructions, explicit content, misinformation, and sensitive information. The impact can range from reputational damage for the LLM provider to potential real-world harm caused by the generated content. The severity of the impact is amplified if the attack leverages low-resource or morphologically diverse languages.
Affected Systems: Multiple large language models (LLMs), including but not limited to: GPT-3.5, GPT-4, Llama 3, Mixtral, and Qwen. The vulnerability likely affects other LLMs with similar multilingual capabilities and safety alignment mechanisms.
Mitigation Steps:
- Improved safety training data: Expand safety training datasets to include a representative range of multilingual scenarios, especially those involving low-resource and morphologically diverse languages.
- Enhanced safety models: Develop safety models that are more robust to the complexities of multilingual input and output, potentially incorporating linguistic features into the safety assessment.
- Input sanitization: Implement more sophisticated input sanitization techniques that are able to detect and mitigate the effects of Multilingual Blending attacks.
- Response verification: Employ robust post-generation checks and verification mechanisms to filter unsafe responses, even if they appear to be grammatically correct. This might include using multiple safety models or human-in-the-loop verification.
© 2025 Promptfoo. All rights reserved.