LMVD-ID: 761a8b38
Published June 1, 2025

LLM Judge Subversion

Affected Models:GPT-4o, Llama 3.1 8B, Llama 3.3 70B, Mistral 7B, DeepSeek-R1, Qwen 2.5 7B

Research Paper

LLMs Cannot Reliably Judge (Yet?): A Comprehensive Assessment on the Robustness of LLM-as-a-Judge

View Paper

Description: Alibaba Cloud PAI-Judge and PAI-Judge-Plus are vulnerable to a composite adversarial attack that exploits attention mechanism limitations in Large Language Models (LLMs). An authenticated attacker can manipulate automated evaluation outcomes by appending a long, irrelevant text suffix (approximately 1000 to 2000+ characters) to a response containing adversarial perturbations. This "long-suffix" strategy overwhelms the judge model's context window, causing the attention mechanism to degrade and fail to focus on the core adversarial content or quality flaws. Consequently, the system assigns significantly inflated scores to low-quality or malicious submissions, bypassing internal defenses such as prompt filtering and output sanitization.

Examples: To reproduce this attack against the PAI-Judge platform:

  1. Preparation: Craft a response ($r_m$) that contains significant quality flaws (warranting a near-zero score) but includes semantic manipulations optimized using gradient-based methods (e.g., PAIR or GCG) to exploit decision boundaries.
  2. Suffix Injection: Append a long string of irrelevant content to $r_m$. The length required to trigger the vulnerability depends on the platform version:
  • Standard PAI-Judge: Append $\approx$ 1000 characters of filler text.
  • PAI-Judge-Plus: Append $>$ 2000 characters of filler text.
  1. Payload Structure:
[Optimized Adversarial Response containing flaws] + [1500 characters of repetitive filler text, e.g., irrelevant facts or distracting narratives]
  1. Submission: Submit this composite payload to the PAI-Judge public API for evaluation.
  2. Observation: The system returns an inflated score (e.g., increasing from a baseline of 1.5 to >6.0 on a 10-point scale) despite the content containing verifiable errors.

Impact:

  • Integrity Violation: Automated benchmarks, academic grading systems, and leaderboards relying on PAI-Judge can be manipulated to rank low-quality submissions higher than legitimate high-quality ones.
  • Bypass of Moderation: Content moderation filters relying on LLM-as-a-Judge can be evaded, allowing malicious or policy-violating content to pass as "safe" or "high quality."
  • Resource Wastage: Downstream processes relying on accurate quality filtering may be flooded with false positives.

Affected Systems:

  • Alibaba Cloud PAI-Judge (Standard Version)
  • Alibaba Cloud PAI-Judge-Plus
  • General LLM-as-a-Judge systems lacking long-context robustness mechanisms.

Mitigation Steps:

  • Retokenization: Implement input retokenization prior to evaluation to disrupt adversarial token patterns and suffix structures.
  • LLM-based Detection: Deploy a separate, naive LLM-based detector to pre-screen submissions for adversarial anomalies or unusual filler content before passing them to the judge model.
  • Input Sanitization: Enforce strict length limits or truncate excessive suffixes that do not contribute to the semantic meaning of the response.
  • Prompt Optimization: Utilize coordinate ascent strategies to identify and deploy judge prompt templates that utilize components less sensitive to attention degradation.

© 2026 Promptfoo. All rights reserved.