Hallucination Vulnerabilities

Exploits causing models to generate false information

Related Vulnerabilities

4 entries

LLM Hate Campaign Vulnerability

2/2/2025

Large Language Models (LLMs) used in hate speech detection systems are vulnerable to adversarial attacks and model stealing, resulting in evasion of hate speech detection. Adversarial attacks modify hate speech text to evade detection, while model stealing creates surrogate models that mimic the target system's behavior.

HateBench: Benchmarking Hate Speech Detectors on LLM-Generated Content and Hate Campaigns

Affects: gpt-3.5, gpt-4, vicuna, baichuan2, dolly2, opt

Context-Shifting Code Injection

12/29/2024

Large Language Models (LLMs) acting as code assistants may recommend malicious code or resources when presented with prompts framed as programming challenges, even if they refuse similar direct prompts. This occurs due to insufficient context-aware safety mechanisms. LLMs may suggest compromised libraries, malicious APIs, or other attack vectors within seemingly benign code examples.

Hallucinating AI Hijacking Attack: Large Language Models and Malicious Code Recommenders

RoleBreak Character Jailbreak

12/29/2024

Large Language Models (LLMs) used in role-playing systems are vulnerable to character hallucination attacks, a form of jailbreak exploiting "query sparsity" and "role-query conflict". Query sparsity occurs when prompts fall outside the model's training data distribution, causing it to generate out-of-character responses. Role-query conflict arises when the prompt contradicts the established character persona, leading to inconsistent behavior. These vulnerabilities allow attackers to elicit unexpected or unwanted behavior from the LLM, potentially compromising the intended functionality of the role-playing system.

RoleBreak: Character Hallucination as a Jailbreak Attack in Role-Playing Systems

Affects: gpt-3.5-turbo, claude-3-haiku, llama-3-8b, mistral-instruct-v0.2-7b

LLM Robot Bias & Violence

4/12/2025

Large Language Models (LLMs) used to control robots exhibit biases leading to discriminatory and unsafe behaviors. When provided with personal characteristics (e.g., race, gender, disability), LLMs generate biased outputs resulting in discriminatory actions (e.g., assigning lower rescue priority to certain groups) and accept or deem feasible dangerous or unlawful instructions (e.g., removing a person's mobility aid).

Llm-driven robots risk enacting discrimination, violence, and unlawful actions

Affects: gpt3.5 (text-davinci-003), mistral 7b v0.1, gemini, gpt-4, gpt-3.5, llama 2