LMVD-ID: 8b9c140e
Published February 1, 2024

Universal LLM Score Inflation

Affected Models:flant5-3b, llama2-7b, mistral-7b, chatgpt, gpt3.5

Research Paper

Is LLM-as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-shot LLM Assessment

View Paper

Description: Large Language Models (LLMs) used for zero-shot text assessment are vulnerable to universal adversarial attacks. Concatenating short phrases ("universal adversarial phrases") to assessed text can artificially inflate the predicted scores, regardless of the actual quality of the text. This vulnerability is particularly pronounced in LLMs performing absolute scoring, as opposed to comparative assessment.

Examples:

The paper demonstrates attacks using phrases like "amazing insightful brilliant" and others; see the paper for specifics. These phrases, when appended to any text, cause significant score inflation across various LLMs. The effectiveness varies depending on the LLM and task (absolute vs. comparative scoring).

Impact:

  • Compromise the reliability of LLM-based assessment systems in high-stakes scenarios (e.g., academic grading, model benchmarking).
  • Enable malicious actors to manipulate evaluation metrics or obtain undeserved high scores.
  • Potential for academic dishonesty and system subversion.

Affected Systems:

LLMs used for zero-shot text assessment, particularly those employing absolute scoring methods. Specific models demonstrated as vulnerable in the research include FlanT5-xl, Llama2-7B, Mistral-7B, and GPT-3.5. The vulnerability is likely to affect other similar models.

Mitigation Steps:

  • Prefer comparative assessment over absolute scoring: Comparative assessment methods demonstrated higher robustness to the described attacks.
  • Implement detection mechanisms: Utilize techniques like perplexity scoring to identify potentially adversarial inputs. This approach shows promise in detecting the described attacks but might be circumvented by sophisticated adversaries. Further research into more robust detection methods is needed.
  • Adversarial training: Train LLMs on adversarial examples to improve their robustness. However, this might negatively impact the overall performance of the model. Further research needed to explore tradeoff.

© 2025 Promptfoo. All rights reserved.