Color-Aware Watermark Bypass

Description: A color-aware attack, Self Color Testing-based Substitution (SCTS), bypasses watermarking mechanisms in LLMs designed to identify AI-generated text. SCTS exploits the LLM's compliance with instructions to infer the "color" (green/red token classification) of tokens, allowing for targeted substitution of watermarked tokens with non-watermarked tokens, thus evading watermark detection. The attack is particularly effective against watermarks that utilize logit perturbation to bias token selection.

Examples: The paper provides examples of prompts used in the Self Color Testing (SCT) phase to determine token colors. These prompts leverage the deterministic nature (temperature=0) of LLMs to generate predictable outputs with varying token frequencies, revealing the color assignment. See arXiv:2405.18540 for detailed examples and algorithms. Examples of the substitution phase are also included in the provided paper showing how green tokens are replaced with red tokens based on SCT's output.

Impact: Successful attacks using SCTS render LLMs' watermarking mechanisms ineffective, enabling the proliferation of undetectable AI-generated content. This impacts applications relying on watermarks for provenance tracking, academic integrity verification, and detection of misinformation campaigns.

Affected Systems: Large language models (LLMs) employing watermarking techniques based on logit perturbation, particularly those vulnerable to the described color-inference attack, are affected. Specifically, the paper demonstrates successful attacks against Vicuna-7b-v1.5-16k and Llama-2-7b-chat-hf using both UMD and Unigram watermarking schemes.

Mitigation Steps:

Improved Watermarking: Develop watermarking schemes that are resistant to color-inference attacks. This could involve incorporating techniques that make color identification significantly more difficult.
Enhanced Detection: Enhance watermark detection methods to incorporate resilience to manipulations and substitutions.
Robustness Testing: Rigorously test watermarking schemes against various attack vectors, including color-aware approaches, before deploying them in real-world applications.
Obfuscation: Obfuscate the watermarking logic to hinder reverse engineering and attack development.

Color-Aware Watermark Bypass

Research Paper