Typos Undermine Watermarks

Description: Large Language Model (LLM) inference-time watermarking schemes are vulnerable to evasion via character-level perturbations that disrupt the model's tokenizer. Unlike token-level attacks (e.g., synonym replacement), character-level edits—such as homoglyph substitutions, zero-width character insertions, and typos—force the tokenizer to segment a single semantic unit into multiple sub-word tokens. This fragmentation alters the context window used by the watermarking hashing function (e.g., the previous $h$ tokens), causing a cascading corruption of watermark keys and scores for subsequent tokens. Adversaries can exploit this utilizing a Genetic Algorithm (GA) guided by a reference detector (a surrogate regression model trained to predict watermark scores) to identify and perturb optimal token positions. This allows for the removal of the watermark signal with a low character editing rate while preserving visual imperceptibility and semantic utility.

Examples:

Homoglyph Substitution: Replacing a character with a visually similar Unicode variant to trigger token splitting.
Input: "position"
Perturbation: Replace 's' with 'š' (U+0161).
Result: The tokenizer splits "pošition" into ["po", "š", "ition"] (or similar sub-units) instead of the single token ["position"]. This changes the context hash for the next $h$ tokens.
Zero-Width Insertion: Injecting invisible characters to break token contiguity.
Input: "compound"
Perturbation: Insert a zero-width space (U+200B) between 'o' and 'u'.
Result: "compo{U+200B}und" is tokenized as ["compo", "{U+200B}", "und"].
Compound Perturbations: Combining multiple edits to bypass basic defenses.
Input: "compound"
Perturbation: Swap 'o' and 'u' combined with homoglyph substitution.
Result: "compuǒnd" (where 'ǒ' is U+01D2).

Impact:

Attribution Evasion: Malicious actors can strip watermarks from AI-generated text, rendering detection tools ineffective.
Copyright Circumvention: Watermarks intended for copyright protection or provenance tracking are neutralized.
Misuse Enablement: Facilitates the undetectable generation of misinformation, academic fraud, and automated phishing content by bypassing "AI-generated" flags.

Affected Systems:

KGW (Kirchenbauer et al.): Watermarking during logits generation.
Unigram (Zhao et al.): Watermarking during logits generation.
DIP (Distribution-Invariant Watermark): Watermarking during probability distribution generation.
SynthID (Google DeepMind): Watermarking during sampling.
Unbias (Wu et al.): Watermarking during probability distribution generation.
Implementations of these schemes found in libraries such as MarkLLM.

Mitigation Steps:

Unicode Normalization (UN): Normalize input text to a standard form (e.g., NFKC) before detection to map homoglyphs and special characters to their canonical representations.
Spell-Checking and Correction (SC): distinct valid words from perturbed tokens and automatically correct typos or anomalies prior to watermark verification.
Optical Character Recognition (OCR) Pre-processing: Render the text as an image and re-extract the content to strip invisible characters and normalize visual homoglyphs.
Anomalous Character Deletion: Identify and remove characters that are statistically rare or invalid within the target language's standard character set (e.g., zero-width spaces).

Typos Undermine Watermarks

Research Paper