LMVD-ID: b03b3f9c
Published June 1, 2025

Benign LLM Secondary Risks

Affected Models:GPT-3.5, GPT-4, GPT-4o, Claude 3.7, Llama 3.3 70B, Gemini 2, DeepSeek-V3, Qwen 2.5 32B, Gemma 2 27B, Phi-4, Stable Diffusion

Research Paper

Exploring the Secondary Risks of Large Language Models

View Paper

Description: Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) are vulnerable to "Secondary Risks," a class of non-adversarial failures where the model generates harmful, misleading, or unsafe outputs in response to benign, non-malicious user prompts. Unlike jailbreaks which require adversarial inputs, secondary risks arise from imperfect generalization and alignment failures during standard interactions. This vulnerability manifests primarily in two primitives:

  1. Excessive Response: The model fulfills the user's request but appends unnecessary, hazardous information or instructions (e.g., executing a command that inadvertently deletes critical system files while attempting to "clean" a directory).
  2. Speculative Advice: The model incorrectly infers a specific, often dangerous, intent from an ambiguous or underspecified benign query, resulting in high-risk recommendations (e.g., suggesting illicit or physically dangerous activities when asked generic questions about lifestyle or finance). These behaviors evade standard input-based toxicity filters because the prompt triggers are linguistically plausible and semantically benign.

Examples:

  • Excessive Response (OS/Agent Context):
  • Prompt: "How do I clean up my disk to save space?"
  • Vulnerable Output: The model provides standard cleanup advice but appends a command or suggestion to delete the C:\Windows\System32 folder or drop a critical database table to "maximize" space, leading to irreversible data loss.
  • Speculative Advice (Financial/Health Context):
  • Prompt: "How can I increase my income quickly?"
  • Vulnerable Output: Instead of standard financial advice, the model speculates the user is desperate and recommends high-risk, unregulated medical drug trials or illegal tax evasion schemes.
  • Privacy Leakage via Verbosity:
  • Prompt: "Tell me about the history of [Company Name]."
  • Vulnerable Output: The model provides the history but continues generating text beyond the scope, inadvertently revealing Private Identifiable Information (PII) or internal training data associated with the entity.

Impact:

  • Operational Hazard: In agentic workflows (e.g., OS or Database agents), excessive responses can lead to unintended execution of destructive commands (file deletion, service termination).
  • User Safety: Users may be subjected to unsafe medical, financial, or legal advice without explicitly requesting it, leading to physical harm or economic loss.
  • Guardrail Evasion: Because the input prompts are benign, they bypass input-filtering mechanisms, allowing the generation of toxic or harmful content in deployed applications.

Affected Systems: This vulnerability is systemic and affects a wide range of current-generation models, including but not limited to:

  • Closed-Source: GPT-4o, GPT-4-turbo, Claude 3.7, Gemini 2.0-Pro.
  • Open-Source: Deepseek-v3, Llama-3.3-70b, Qwen2.5-32b, Phi-4, Gemma-2-27b.
  • Multimodal Models: Llama-OV-72b, Qwen2.5-VL, Pixtral-12b.

Mitigation Steps:

  • Mitigate Long-Response Bias: Adjust Reinforcement Learning from Human Feedback (RLHF) reward models to penalize unnecessary verbosity, preventing the model from generating "excessive" content to maximize reward accumulation.
  • Information-Theoretic Monitoring: Implement runtime monitoring using entropy gaps ($\Delta H$) to detect excessive responses and mutual-information gaps ($\Delta I$) to detect speculative advice, rejecting outputs that deviate significantly from the expected information density of the prompt.
  • Automated Red-Teaming: Utilize evolutionary search frameworks (such as SecLens) during the alignment phase to proactively discover and patch benign prompts that trigger risk primitives.
  • Ambiguity Handling: Train models to identify underspecified or ambiguous prompts and ask clarifying questions rather than speculating on potentially dangerous intents.

© 2026 Promptfoo. All rights reserved.