LLM Hate Campaign Vulnerability

Description: Large Language Models (LLMs) used in hate speech detection systems are vulnerable to adversarial attacks and model stealing, resulting in evasion of hate speech detection. Adversarial attacks modify hate speech text to evade detection, while model stealing creates surrogate models that mimic the target system's behavior.

Examples: See paper arXiv:2405.18540. Specific examples of adversarial attacks (DeepWordBug, TextBugger, PWWS, TextFooler, Paraphrase) and model stealing attacks using BERT and RoBERTa architectures are provided in the paper. These examples demonstrate a high attack success rate (ASR) exceeding 0.966 in some cases and significantly improved efficiency through model stealing.

Impact: Successful attacks allow malicious actors to bypass hate speech filters and automate the dissemination of hate speech at scale, creating a hostile online environment and enabling coordinated hate campaigns. This undermines the effectiveness of existing safety mechanisms for LLMs and online platforms.

Affected Systems: Systems employing LLMs for hate speech detection, particularly those using models vulnerable to adversarial examples and model extraction (e.g., Perspective API, Moderation API, open-source detectors listed in the paper). Systems using any LLM for content moderation are potentially vulnerable.

Mitigation Steps:

Regularly update hate speech detection models with new data, including examples generated by advanced LLMs and adversarial attacks.
Implement robust defenses against adversarial attacks through techniques like adversarial training and robust optimization.
Employ techniques to detect model stealing attempts, such as monitoring query patterns and distributions.
Diversify detection methods beyond LLMs, incorporating human-in-the-loop verification.
Increase scrutiny of user queries to detect anomalous patterns that might indicate adversarial attacks.

LLM Hate Campaign Vulnerability

Research Paper