LMVD-ID: f65dc447
Published February 1, 2025

Zero-Perturbation Emoji Attack

Affected Models:GPT-4o, Claude 3.5, Llama 3 8B, Qwen 2.5 7B

Research Paper

Emoti-Attack: Zero-Perturbation Adversarial Attacks on NLP Systems via Emoji Sequences

View Paper

Description: The Emoti-Attack vulnerability constitutes a zero-word-perturbation adversarial attack against Natural Language Processing (NLP) systems and Large Language Models (LLMs). The vulnerability exploits the discrete embedding space of emojis and emoticons to manipulate model behavior without altering the semantic content or character integrity of the original text. By appending strategically optimized emoji sequences to the prefix and suffix of an input string (formalized as $s \oplus x \oplus s'$), an attacker can induce classification errors or manipulate model responses. The attack utilizes a two-phase learning framework—supervised pretraining followed by reinforcement learning via a Markov Decision Process (MDP)—to generate emoji sequences that maximize prediction divergence while maintaining "emotional consistency" to evade detection. This method treats emoji modification as a distinct attack layer, distinct from character or word-level perturbations.

Examples: The attack constructs adversarial inputs by concatenating emoji sequences from a unified vocabulary $\mathcal{V} = \mathcal{V}_t \cup \mathcal{V}_e$ (text and emoji tokens).

  • Attack Construction: Given an input text $x$ and a target model $f_{tgt}$, the adversarial example is generated as: $$Input_{adv} = \text{cat}(s) \cdot x \cdot \text{cat}(s')$$ Where $s$ and $s'$ are sequences of Unicode emojis (e.g., "🔥", "😀") or ASCII emoticons (e.g., ":)", "QaQ", ";-P") optimized to alter the model's output label $y$.

  • See Paper/Dataset: Refer to the "Emoti-Attack" methodology for the specific reinforcement learning reward functions used to generate these sequences. The attack logic relies on the "Emoji Logits Processor" (ELP) to dynamically adjust token probabilities during generation.

Impact:

  • Model Evasion and Misclassification: Successful attacks force models to misclassify text with high confidence (Attack Success Rates up to 95.34% on varying datasets).
  • Content Moderation Bypass: Malicious actors can bypass content filtering mechanisms in LLMs by surrounding prohibited content with "positive" or "playful" emoji sequences that shift the model's interpretation of the intent.
  • Automated Decision Manipulation: Systems relying on LLMs for sentiment analysis, information retrieval, or automated decision-making can be manipulated into producing incorrect outcomes without detectable changes to the core textual data.

Affected Systems:

  • Transformer-based Classifiers: BERT, RoBERTa.
  • Open Source LLMs: Qwen2.5-7b-Instruct, Llama3-8b-Instruct.
  • Proprietary LLMs: GPT-4o, Claude 3.5 Sonnet, Gemini-Exp-1206.

© 2026 Promptfoo. All rights reserved.