Exploits causing models to generate false information
Large Language Models (LLMs) used in hate speech detection systems are vulnerable to adversarial attacks and model stealing, resulting in evasion of hate speech detection. Adversarial attacks modify hate speech text to evade detection, while model stealing creates surrogate models that mimic the target system's behavior.
Large Language Models (LLMs) acting as code assistants may recommend malicious code or resources when presented with prompts framed as programming challenges, even if they refuse similar direct prompts. This occurs due to insufficient context-aware safety mechanisms. LLMs may suggest compromised libraries, malicious APIs, or other attack vectors within seemingly benign code examples.
Large Language Models (LLMs) used in role-playing systems are vulnerable to character hallucination attacks, a form of jailbreak exploiting "query sparsity" and "role-query conflict". Query sparsity occurs when prompts fall outside the model's training data distribution, causing it to generate out-of-character responses. Role-query conflict arises when the prompt contradicts the established character persona, leading to inconsistent behavior. These vulnerabilities allow attackers to elicit unexpected or unwanted behavior from the LLM, potentially compromising the intended functionality of the role-playing system.
Large Language Models (LLMs) used to control robots exhibit biases leading to discriminatory and unsafe behaviors. When provided with personal characteristics (e.g., race, gender, disability), LLMs generate biased outputs resulting in discriminatory actions (e.g., assigning lower rescue priority to certain groups) and accept or deem feasible dangerous or unlawful instructions (e.g., removing a person's mobility aid).
© 2025 Promptfoo. All rights reserved.