LLM Judge Prompt Injection
Research Paper
Investigating the Vulnerability of LLM-as-a-Judge Architectures to Prompt-Injection Attacks
View PaperDescription: Large Language Models (LLMs) used for evaluating text quality (LLM-as-a-Judge architectures) are vulnerable to prompt-injection attacks. Maliciously crafted suffixes appended to input text can manipulate the LLM's judgment, causing it to incorrectly favor a predetermined response even if another response is objectively superior. Two attack vectors are identified: Comparative Undermining Attack (CUA), directly targeting the final decision, and Justification Manipulation Attack (JMA), altering the model's generated reasoning.
Examples: See arXiv:2405.18540 (Note: This link is a placeholder, replace with the actual arXiv link if available). The paper details specific examples of adversarial suffixes generated using the Greedy Coordinate Gradient (GCG) optimization method. Examples show successful attacks resulting in over 30% Attack Success Rate (ASR) for CUA and 15-17% for JMA.
Impact: Compromised evaluation leading to inaccurate assessments of generated text quality. This could impact multiple applications that rely on LLMs for evaluation such as Reinforcement Learning from Human Feedback (RLHF), automated validation in crowdsourcing, and intelligent search and retrieval systems. Attackers can manipulate the LLM's judgment to promote inferior content or suppress superior content.
Affected Systems: Systems employing open-source instruction-tuned LLMs (such as Qwen2.5-3B-Instruct and Falcon3-3B-Instruct) in LLM-as-a-Judge architectures, or similar models vulnerable to prompt injection.
Mitigation Steps:
- Develop and implement robust defense mechanisms against prompt injection attacks. The paper suggests exploring techniques like goal prioritization and attention tracking.
- Utilize adversarial training techniques to improve the robustness of LLM-as-a-Judge models to prompt injection.
- Employ input sanitization and filtering techniques to identify and remove potentially malicious suffixes.
- Regularly audit and update LLM models to address vulnerabilities as they are discovered.
- Implement verification methods to independently validate the evaluations produced by the LLM-as-a-Judge system.
© 2025 Promptfoo. All rights reserved.