LMVD-ID: f4f27d04
Published September 1, 2025

Typos Undermine Watermarks

Affected Models:Llama 3 8B

Research Paper

Character-Level Perturbations Disrupt LLM Watermarks

View Paper

Description: Large Language Model (LLM) inference-time watermarking schemes are vulnerable to evasion via character-level perturbations that disrupt the model's tokenizer. Unlike token-level attacks (e.g., synonym replacement), character-level edits—such as homoglyph substitutions, zero-width character insertions, and typos—force the tokenizer to segment a single semantic unit into multiple sub-word tokens. This fragmentation alters the context window used by the watermarking hashing function (e.g., the previous $h$ tokens), causing a cascading corruption of watermark keys and scores for subsequent tokens. Adversaries can exploit this utilizing a Genetic Algorithm (GA) guided by a reference detector (a surrogate regression model trained to predict watermark scores) to identify and perturb optimal token positions. This allows for the removal of the watermark signal with a low character editing rate while preserving visual imperceptibility and semantic utility.

Examples:

  • Homoglyph Substitution: Replacing a character with a visually similar Unicode variant to trigger token splitting.
  • Input: "position"
  • Perturbation: Replace 's' with 'š' (U+0161).
  • Result: The tokenizer splits "pošition" into ["po", "š", "ition"] (or similar sub-units) instead of the single token ["position"]. This changes the context hash for the next $h$ tokens.
  • Zero-Width Insertion: Injecting invisible characters to break token contiguity.
  • Input: "compound"
  • Perturbation: Insert a zero-width space (U+200B) between 'o' and 'u'.
  • Result: "compo{U+200B}und" is tokenized as ["compo", "{U+200B}", "und"].
  • Compound Perturbations: Combining multiple edits to bypass basic defenses.
  • Input: "compound"
  • Perturbation: Swap 'o' and 'u' combined with homoglyph substitution.
  • Result: "compuǒnd" (where 'ǒ' is U+01D2).

Impact:

  • Attribution Evasion: Malicious actors can strip watermarks from AI-generated text, rendering detection tools ineffective.
  • Copyright Circumvention: Watermarks intended for copyright protection or provenance tracking are neutralized.
  • Misuse Enablement: Facilitates the undetectable generation of misinformation, academic fraud, and automated phishing content by bypassing "AI-generated" flags.

Affected Systems:

  • KGW (Kirchenbauer et al.): Watermarking during logits generation.
  • Unigram (Zhao et al.): Watermarking during logits generation.
  • DIP (Distribution-Invariant Watermark): Watermarking during probability distribution generation.
  • SynthID (Google DeepMind): Watermarking during sampling.
  • Unbias (Wu et al.): Watermarking during probability distribution generation.
  • Implementations of these schemes found in libraries such as MarkLLM.

Mitigation Steps:

  • Unicode Normalization (UN): Normalize input text to a standard form (e.g., NFKC) before detection to map homoglyphs and special characters to their canonical representations.
  • Spell-Checking and Correction (SC): distinct valid words from perturbed tokens and automatically correct typos or anomalies prior to watermark verification.
  • Optical Character Recognition (OCR) Pre-processing: Render the text as an image and re-extract the content to strip invisible characters and normalize visual homoglyphs.
  • Anomalous Character Deletion: Identify and remove characters that are statistically rare or invalid within the target language's standard character set (e.g., zero-width spaces).

© 2026 Promptfoo. All rights reserved.