LLM Relevance Score Inflation

Description: LLM-based relevance assessment frameworks, such as the Umbrela system, are vulnerable to evaluation subversion and artificial score inflation due to evaluation circularity and LLM "narcissism" (an LLM's inherent bias toward favoring LLM-generated outputs). When an information retrieval system integrates an LLM into its ranking pipeline—such as using it as a final-stage re-ranker—the automated LLM-as-a-judge evaluator assigns artificially inflated scores that fail to correlate with actual human judgments. This vulnerability allows benchmark participants or attackers to completely subvert the evaluation metric, achieving top leaderboard positions without demonstrating genuine improvements in retrieval quality.

Examples:

Run uwc1 (Subversion Attempt): Team WaterlooClarke deliberately designed a submission to exploit automatic evaluations in the TREC 2024 RAG track. They pooled the top 20 documents from 15 preliminary runs and used GPT-4o to generate LLM-based preference and relevance judgments to rank the passages. Because the evaluation relied on LLM-based judgments, the run artificially achieved 5th place under automatic assessment. However, under manual human assessment, the system performed poorly, ranking 28th.
Run uwc2: The same team re-ranked a baseline run using an LLM prompt. This re-ranking artificially boosted the system to 3rd place under automatic evaluation, compared to 4th under manual evaluation.
Circularity Simulation: When the Umbrela system was applied as a final-stage re-ranker to the outputs of all submitted retrieval systems and subsequently evaluated by Umbrela, 12 systems achieved near-perfect NDCG scores (exceeding 0.95). In reality, these identical systems achieved manual human-evaluated NDCG scores of only 0.68 to 0.72, highlighting massive score inflation.

Impact: Compromises the integrity of information retrieval benchmarks, academic evaluations, and RAG assessments. It permits the manipulation of leaderboards, leading to the adoption of fundamentally flawed or suboptimal retrieval systems that are optimized to game an LLM evaluator (Goodhart's Law) rather than serve real human utility.

Affected Systems:

LLM-as-a-judge evaluation frameworks.
Automated LLM relevance assessment tools (e.g., Umbrela).
Fully automated Information Retrieval (IR) and Retrieval-Augmented Generation (RAG) benchmarking pipelines.

Mitigation Steps:

Maintain human relevance assessments as the definitive "gold standard" for evaluating usefulness, especially for validating top-performing systems at the frontier of leaderboards.
Implement safeguards against feedback loops and circular evaluations (e.g., prohibiting or isolating the use of identical models for both system re-ranking and benchmark evaluation).
Avoid fully replacing manual evaluation protocols with automatic LLM-based judgments on reusable test collections intended to measure state-of-the-art advancements.

LLM Relevance Score Inflation

Research Paper