LMVD-ID: 6df5d2d0
Published June 1, 2025

Breaking the LLM Reviewer

Affected Models:GPT-4o, Llama 3.3 70B, Mistral Large

Research Paper

Breaking the Reviewer: Assessing the Vulnerability of Large Language Models in Automated Peer Review Under Textual Adversarial Attacks

View Paper

Description: Large Language Models (LLMs) deployed in automated peer review workflows are vulnerable to targeted textual adversarial attacks. By employing a technique defined as "Attack Focus Localization," an attacker can identify critical document segments via Longest Common Subsequence (LCS) matching between the original text and an initial LLM-generated review. Injecting semantic-preserving perturbations—such as character-level noise, synonym substitution (e.g., TextFooler), or stylistic transfer (e.g., StyleAdv)—into these localized segments causes the LLM to statistically significantly inflate quality scores (e.g., boosting "Soundness" or "Originality" ratings) and suppress negative aspect tags. This vulnerability bypasses standard AI-text detectors and allows manipulated manuscripts to receive favorable automated assessments without altering the paper's actual scientific contribution.

Examples:

  • Repository: See https://github.com/Lin-TzuLing/Breaking-the-Reviewer.git for implementation of the Attack Focus Localization and perturbation pipeline.
  • Attack Reproduction:
  1. Localization: Input a target paper $x_{clean}$ into the LLM to generate a preliminary review $f_r(x_{clean})$. Compute the Longest Common Subsequence to find modifiable indices $\mathcal{M} = \text{LCS}(x_{clean}, f_r(x_{clean}))$.
  2. Perturbation: Apply TextFooler or StyleAdv to the segments identified in $\mathcal{M}$.
    • Example Perturbation (StyleAdv): Transforming "The method achieves high accuracy" to "Verily, the approach attaineth great precision" (stylistic shift) or synonymous substitution.
  3. Result: The attacked input $x_{adv}$ results in a score shift $f_s(x_{adv}) - f_s(x_{clean}) \ge 1.0$ on a 10-point scale compared to the clean input.

Impact:

  • Integrity Compromise: Automated review systems can be manipulated to accept low-quality or rejected-tier papers.
  • Score Inflation: Attackers can artificially boost paper scores across multiple metrics (Substance, Clarity, Impact).
  • Sentiment Manipulation: The attack successfully suppresses negative sentiment aspect tags in the generated review text, leading to overly optimistic textual feedback.

Affected Systems:

  • Automated Peer Review systems utilizing the following models (and likely others sharing similar architectures):
  • OpenAI GPT-4o
  • OpenAI GPT-4o-mini
  • Meta Llama-3.3-70B
  • Mistral-small-3.1

Mitigation Steps:

  • Human-in-the-Loop Verification: Do not rely solely on LLM-generated scores or reviews for decision-making; maintain strict human oversight for final acceptance/rejection decisions.
  • Inference-Time Monitoring: Implement monitoring to detect statistical anomalies in review generation patterns, although current zero-shot detectors (e.g., GPTZero) are ineffective against these specific adversarial perturbations.
  • Adversarial Robustness Testing: Evaluate LLM reviewers against known textual attack benchmarks (TextFooler, BERT-Attack) prior to deployment in high-stakes academic review pipelines.

© 2026 Promptfoo. All rights reserved.