LLM Judge Adversarial Vulnerability

Description: Large Language Model (LLM) safety judges exhibit vulnerability to adversarial attacks and stylistic prompt modifications, leading to increased false negative rates (FNR) and decreased accuracy in classifying harmful model outputs. Minor stylistic changes to model outputs, such as altering the formatting or tone, can significantly impact a judge's classification, while direct adversarial modifications to the generated text can fool judges into misclassifying even 100% of harmful generations as safe. This vulnerability impacts the reliability of LLM safety evaluations used in offline benchmarking, automated red-teaming, and online guardrails.

Examples:

Stylistic Modification: Using a storytelling style for model outputs increased the FNR of HarmBench by 0.24, ShieldGemma by 0.20, and WildGuard by 0.12, even though human annotators maintained near-perfect agreement with the original harmfulness labels. (See paper for specific examples and prompt engineering techniques).
Adversarial Output Modification: The "Prepend+Append Benign" attack caused WildGuard's FNR to reach 1.0 (100% false negatives), completely fooling the judge. (See paper for examples of the attack prompts.

Impact: Compromised reliability of LLM safety evaluations. Attacks can bypass safety mechanisms, allowing harmful content to be generated and deployed. This weakens the security of systems reliant on these judges. A false sense of security may be created due to low attack success rates in limited, non-adversarial test environments.

Affected Systems: LLM safety judges, specifically HarmBench, WildGuard, ShieldGemma, LLaMA Guard 3, and other LLMs used for safety evaluation as demonstrated in the paper. This likely affects other similar systems.

Mitigation Steps:

Robust Meta-Evaluation: Conduct comprehensive meta-evaluations that include out-of-distribution data and adversarial attacks targeting the judge model.
Diverse Training Data: Train safety judges on a wider variety of stylistic output formats and adversarial examples.
Ensemble Methods: Employ ensemble methods to combine the results of multiple safety judges to reduce the reliance on a single judge's potentially flawed assessment.
Input and Output Sanitization: Consider using input sanitization techniques and/or output verification methods to mitigate the impact of adversarial attacks.
Continuous Monitoring: Implement ongoing monitoring and evaluation of LLM safety judges to detect and address emerging vulnerabilities.

LLM Judge Adversarial Vulnerability

Research Paper