Low-Resource Language Jailbreak

Description: Large Language Models (LLMs), such as GPT-4, exhibit a cross-lingual vulnerability in their safety mechanisms. Translating unsafe English prompts into low-resource languages, using readily available translation APIs like Google Translate, bypasses the LLM's safety filters and elicits harmful responses with a significantly higher success rate than attacks targeting the English language directly. The vulnerability stems from an unequal distribution of safety training data across languages, resulting in poor generalization of safety mechanisms to low-resource languages.

Examples: See the paper "Low-resource Languages Jailbreak GPT-4". The paper includes numerous examples demonstrating successful jailbreaks by translating English prompts into low-resource languages such as Zulu and Scots Gaelic. The translated responses, when re-translated to English, often contain coherent and actionable harmful instructions.

Impact: Attackers can easily circumvent the safety mechanisms of LLMs, leading to the generation of harmful content including instructions for creating explosives, performing illegal financial activities, spreading misinformation, and promoting violence. This risk affects a large population of low-resource language speakers directly and indirectly exposes all LLM users to harmful content via translation APIs.

Affected Systems: Large Language Models (LLMs) whose safety training data is disproportionately weighted towards high-resource languages. Specifically, the paper demonstrates the vulnerability on GPT-4 (gpt-4-0613).

Mitigation Steps:

Increase the diversity and quantity of safety training data to include a wide range of low-resource languages.
Develop and implement more robust multilingual safety mechanisms that generalize effectively across various languages.
Regularly conduct multilingual red-teaming exercises to identify and address cross-lingual security vulnerabilities.
Evaluate and improve the safety of translation APIs used in conjunction with LLMs.

Low-Resource Language Jailbreak

Research Paper