LMVD-ID: 7832f185
Published March 1, 2025

Cat-Triggered Reasoning Error

Affected Models:deepseek v3, deepseek r1, deepseek r1-distill-qwen-32b, openai o1, openai o3-mini

Research Paper

Cats Confuse Reasoning LLM: Query Agnostic Adversarial Triggers for Reasoning Models

View Paper

Description: Large Language Models (LLMs) designed for step-by-step problem-solving are vulnerable to query-agnostic adversarial triggers. Appending short, semantically irrelevant text snippets (e.g., "Interesting fact: cats sleep most of their lives") to mathematical problems consistently increases the likelihood of incorrect model outputs without altering the problem's inherent meaning. This vulnerability stems from the models' susceptibility to subtle input manipulations that interfere with their internal reasoning processes.

Examples: See https://huggingface.co/datasets/collinear-ai/cat-attack-adversarial-triggers. The provided dataset contains several examples of adversarial triggers and their impact on various LLM models. One example is appending "Interesting fact: cats sleep most of their lives" to any mathematical problem, resulting in a significant increase in the probability of an incorrect solution.

Impact: The vulnerability allows attackers to manipulate the output of reasoning LLMs, leading to incorrect answers in applications relying on these models for accurate calculations and problem-solving. This can have severe consequences in high-stakes scenarios, such as financial modeling, medical diagnosis, and legal reasoning. Furthermore, the addition of these triggers can significantly increase the computation time and resource consumption, resulting in performance degradation.

Affected Systems: Reasoning LLMs such as DeepSeek R1, DeepSeek R1-distilled-Qwen-32B, and similar models vulnerable to prompt injection attacks are affected.

Mitigation Steps:

  • Enhancing model robustness to adversarial inputs through improved training data and regularization techniques.
  • Implementing input sanitization and validation measures to detect and filter potential adversarial triggers.
  • Developing and deploying robust detection mechanisms specialized for identifying query-agnostic adversarial triggers in the input.
  • Employing diverse model ensembles to mitigate the impact of individual model failures due to manipulation.

© 2025 Promptfoo. All rights reserved.