RAFT: Realistic LLM Detector Evasion

Description: Large Language Model (LLM) detectors are vulnerable to a realistic adversarial attack ("RAFT") that substitutes words in machine-generated text to evade detection. The attack leverages an auxiliary LLM to select optimal words for substitution based on their impact on the target detector's score, while maintaining grammatical correctness and semantic coherence. This allows the attacker to significantly reduce the probability of detection (up to 99%) while preserving text quality, making the altered text indistinguishable from human-written text to human evaluators.

Examples: See the Raft paper for examples of adversarial text generated by the attack and their impact on various models' detection scores (https://arxiv.org/abs/2405.18540). Examples are included in the paper demonstrating adversarial modifications of GPT3.5-turbo generated text, reducing detection scores across multiple target detectors.

Impact: Successful attacks significantly reduce the effectiveness of LLM detectors, allowing malicious actors to bypass detection mechanisms for machine-generated content such as disinformation, phishing attempts, and academic dishonesty. The attack's realistic nature makes detection by humans extremely difficult.

Affected Systems: All LLM detectors tested in the Raft paper, and potentially any LLM detector relying on statistical properties of generated text. This includes, but is not limited to, Log Likelihood, Log Rank, DetectGPT, Fast-DetectGPT, Ghostbusters, and Raidar.

Mitigation Steps:

Develop more robust LLM detectors that are less susceptible to adversarial attacks based on word-level substitutions that maintain grammatical correctness.
Incorporate adversarial training techniques using examples generated by attacks like RAFT to improve the robustness of existing detectors.
Implement multi-layered detection mechanisms that combine different detection methods to reduce the impact of individual vulnerabilities.
Develop techniques to detect and flag text that exhibits statistically unusual word choices, even if the text is grammatically correct and semantically coherent.

RAFT: Realistic LLM Detector Evasion

Research Paper