Prompt Translation Jailbreak

Description: A vulnerability in safety-aligned Large Language Models (LLMs) allows attackers to bypass safety mechanisms using adversarial prompt translation. The vulnerability stems from the ability to translate garbled adversarial prompts generated by gradient-based attacks into coherent, human-readable prompts that retain their adversarial capability. This allows for the successful transfer of attacks across different LLMs.

Examples: See https://github.com/qizhangli/Adversarial-Prompt-Translator. The paper provides several examples of garbled adversarial prompts and their translations which successfully jailbreak various LLMs, including GPT and Claude series models. Specific examples are included in Tables 1 and 3 of the paper.

Impact: Successful exploitation allows attackers to elicit harmful or unsafe responses from safety-aligned LLMs, bypassing intended safety restrictions. This can lead to the generation of illegal content, dissemination of misinformation, or other malicious activities. The high success rate and transferability of the attack across different LLMs are significant concerns.

Affected Systems: Various safety-aligned LLMs, including (but not limited to) GPT-3.5-Turbo, GPT-4, GPT-4-Turbo, GPT-4o-mini, GPT-4o, Claude-Haiku, Claude-Sonnet, Llama-2-7B-Chat, Vicuna-7B-v1.5, and Mistral-7B-Instruct. The vulnerability is likely present in other similar LLMs.

Mitigation Steps:

Implement robust detection mechanisms for adversarial prompts, possibly based on semantic analysis and perplexity scoring.
Develop and integrate more sophisticated safety filters that can effectively identify and block both garbled and translated adversarial prompts.
Improve the training data and methodologies used to align LLMs, making them more resistant to adversarial attacks. Focus on techniques that are robust against semantic manipulation.
Regularly update and refine safety mechanisms in response to new attack techniques.

Prompt Translation Jailbreak

Research Paper