Emoji Judge Bypass

Description: Large Language Models (LLMs) used as safety judges are vulnerable to an "Emoji Attack," a prompt injection technique that leverages token segmentation bias. Inserting emojis within tokens alters sub-token embeddings, misleading the judge LLM into classifying harmful content as safe. The attack's effectiveness is amplified by strategically placing emojis to maximize the embedding discrepancy between sub-tokens and the original token.

Examples:

The paper provides examples of emoji insertion within harmful prompts to bypass the judge LLMs. See https://github.com/zhipeng-wei/EmojiAttack. Specific examples are shown in Figures 1(b) and 3 of the paper.

Impact: The Emoji Attack allows malicious prompts to bypass LLM safety filters, enabling the generation and dissemination of harmful content. This undermines the effectiveness of LLM-based safety mechanisms, potentially leading to the spread of misinformation, hate speech, or other harmful outputs. The severity depends on the specific judge LLM and the type of harmful content generated. The paper demonstrates success rates of up to 75% bypass for some judge LLMs.

Affected Systems: LLM safety systems employing LLMs as judges, particularly those susceptible to token segmentation bias. Specific LLMs affected include Llama Guard, Llama Guard 2, ShieldLM, WildGuard, GPT-3.5, and GPT-4 (to varying degrees).

Mitigation Steps:

Develop judge LLMs with increased robustness to token segmentation bias.
Implement character filtering mechanisms that are more sophisticated than simple removal of "unusual" characters. Consider approaches that analyze context and embedding changes rather than just character types.
Develop detection mechanisms that identify patterns indicative of the Emoji Attack, such as unusual character placement within tokens.
Utilize diverse and robust evaluation metrics beyond simple "unsafe" prediction ratios when assessing LLM safety.

Emoji Judge Bypass

Research Paper