Typos Undermine Watermarks
Research Paper
Character-Level Perturbations Disrupt LLM Watermarks
View PaperDescription: Large Language Model (LLM) inference-time watermarking schemes are vulnerable to evasion via character-level perturbations that disrupt the model's tokenizer. Unlike token-level attacks (e.g., synonym replacement), character-level edits—such as homoglyph substitutions, zero-width character insertions, and typos—force the tokenizer to segment a single semantic unit into multiple sub-word tokens. This fragmentation alters the context window used by the watermarking hashing function (e.g., the previous $h$ tokens), causing a cascading corruption of watermark keys and scores for subsequent tokens. Adversaries can exploit this utilizing a Genetic Algorithm (GA) guided by a reference detector (a surrogate regression model trained to predict watermark scores) to identify and perturb optimal token positions. This allows for the removal of the watermark signal with a low character editing rate while preserving visual imperceptibility and semantic utility.
Examples:
- Homoglyph Substitution: Replacing a character with a visually similar Unicode variant to trigger token splitting.
- Input: "position"
- Perturbation: Replace 's' with 'š' (U+0161).
- Result: The tokenizer splits "pošition" into
["po", "š", "ition"](or similar sub-units) instead of the single token["position"]. This changes the context hash for the next $h$ tokens. - Zero-Width Insertion: Injecting invisible characters to break token contiguity.
- Input: "compound"
- Perturbation: Insert a zero-width space (U+200B) between 'o' and 'u'.
- Result: "compo{U+200B}und" is tokenized as
["compo", "{U+200B}", "und"]. - Compound Perturbations: Combining multiple edits to bypass basic defenses.
- Input: "compound"
- Perturbation: Swap 'o' and 'u' combined with homoglyph substitution.
- Result: "compuǒnd" (where 'ǒ' is U+01D2).
Impact:
- Attribution Evasion: Malicious actors can strip watermarks from AI-generated text, rendering detection tools ineffective.
- Copyright Circumvention: Watermarks intended for copyright protection or provenance tracking are neutralized.
- Misuse Enablement: Facilitates the undetectable generation of misinformation, academic fraud, and automated phishing content by bypassing "AI-generated" flags.
Affected Systems:
- KGW (Kirchenbauer et al.): Watermarking during logits generation.
- Unigram (Zhao et al.): Watermarking during logits generation.
- DIP (Distribution-Invariant Watermark): Watermarking during probability distribution generation.
- SynthID (Google DeepMind): Watermarking during sampling.
- Unbias (Wu et al.): Watermarking during probability distribution generation.
- Implementations of these schemes found in libraries such as MarkLLM.
Mitigation Steps:
- Unicode Normalization (UN): Normalize input text to a standard form (e.g., NFKC) before detection to map homoglyphs and special characters to their canonical representations.
- Spell-Checking and Correction (SC): distinct valid words from perturbed tokens and automatically correct typos or anomalies prior to watermark verification.
- Optical Character Recognition (OCR) Pre-processing: Render the text as an image and re-extract the content to strip invisible characters and normalize visual homoglyphs.
- Anomalous Character Deletion: Identify and remove characters that are statistically rare or invalid within the target language's standard character set (e.g., zero-width spaces).
© 2026 Promptfoo. All rights reserved.