Security concerns affecting AI safety and alignment
AI code agents are vulnerable to jailbreaking attacks that cause them to generate or complete malicious code. The vulnerability is significantly amplified when a base Large Language Model (LLM) is integrated into an agentic framework that uses multi-step planning and tool-use. Initial safety refusals by the LLM are frequently overturned during subsequent planning or self-correction steps within the agent's reasoning loop.
A vulnerability exists in Large Language Models (LLMs) that support fine-tuning, allowing an attacker to bypass safety alignments using a small, benign dataset. The attack, "Attack via Overfitting," is a two-stage process. In Stage 1, the model is fine-tuned on a small set of benign questions (e.g., 10) paired with identical, repetitive refusal answers. This induces an overfitted state where the model learns to refuse all prompts, creating a sharp minimum in the loss landscape and making it highly sensitive to parameter changes. In Stage 2, the overfitted model is further fine-tuned on the same benign questions, but with their standard, helpful answers. This second fine-tuning step causes catastrophic forgetting of the general refusal behavior, leading to a collapse of safety alignment and causing the model to comply with harmful and malicious instructions. The attack is highly stealthy as the fine-tuning data appears benign to content moderation systems.
A vulnerability exists in certain safety-aligned Large Language Models (LLMs) due to an untargeted, gradient-based optimization attack method called Untargeted Jailbreak Attack (UJA). Unlike previous targeted attacks (e.g., GCG) that optimize a prompt to elicit a predefined string (e.g., "Sure, here is..."), UJA optimizes for a general objective: maximizing the unsafety probability of the model's response, as quantified by an external judge model.
unsafety
Large Language Models from multiple vendors are vulnerable to a "Camouflaged Jailbreak" attack. Malicious instructions are embedded within seemingly benign, technically complex prompts, often framed as system design or engineering problems. The models fail to recognize the harmful intent implied by the context and technical specifications, bypassing safety filters that rely on detecting explicit keywords. This leads to the generation of detailed, technically plausible instructions for creating dangerous devices or systems. The attack has a high success rate, with models demonstrating full obedience in over 94% of tested cases, treating the harmful requests as legitimate.
A vulnerability exists in tool-enabled Large Language Model (LLM) agents, termed Sequential Tool Attack Chaining (STAC), where a sequence of individually benign tool calls can be orchestrated to achieve a malicious outcome. An attacker can guide an agent through a multi-turn interaction, with each step appearing harmless in isolation. Safety mechanisms that evaluate individual prompts or actions fail to detect the threat because the malicious intent is distributed across the sequence and only becomes apparent from the cumulative effect of the entire tool chain, typically at the final execution step. This allows the bypass of safety guardrails to execute harmful actions in the agent's environment.
A vulnerability, termed "Content Concretization," exists in Large Language Models (LLMs) wherein safety filters can be bypassed by iteratively refining a malicious request. The attack uses a less-constrained, lower-tier LLM to generate a preliminary draft (e.g., pseudocode or a non-executable prototype) of a malicious tool from an abstract prompt. This "concretized" draft is then passed to a more capable, higher-tier LLM. The higher-tier LLM, when prompted to refine or complete the existing draft, is significantly more likely to generate the full malicious, executable content than if it had received the initial abstract prompt directly. This exploits a weakness in safety alignment where models are more permissive in extending existing content compared to generating harmful content from scratch.
A zero-click indirect prompt injection vulnerability, CVE-2025-32711, existed in Microsoft 365 Copilot. A remote, unauthenticated attacker could exfiltrate sensitive data from a victim's session by sending a crafted email. When Copilot later processed this email as part of a user's query, hidden instructions caused it to retrieve sensitive data from the user's context (e.g., other emails, documents) and embed it into a URL. The attack chain involved bypassing Microsoft's XPIA prompt injection classifier, evading link redaction filters using reference-style Markdown, and abusing a trusted Microsoft Teams proxy domain to bypass the client-side Content Security Policy (CSP), resulting in automatic data exfiltration without any user interaction.
A vulnerability exists in multiple Large Language Models (LLMs) where an attacker can bypass safety alignments by exploiting the model's ethical reasoning capabilities. The attack, named TRIAL (Trolley-problem Reasoning for Interactive Attack Logic), frames a harmful request within a multi-turn ethical dilemma modeled on the trolley problem. The harmful action is presented as the "lesser of two evils" necessary to prevent a catastrophic outcome, compelling the model to engage in utilitarian justification. This creates a conflict between the model's deontological safety rules (e.g., "do not generate harmful content") and the consequentialist logic of the scenario. Through a series of iterative, context-aware queries, the attacker progressively reinforces the model's commitment to the harmful path, leading it to generate content it would normally refuse. The vulnerability is paradoxically more effective against models with more advanced reasoning abilities.
A vulnerability exists in multiple Large Language Models (LLMs) where safety alignment mechanisms can be bypassed by reframing harmful instructions as "learning-style" or academic questions. This technique, named Hiding Intention by Learning from LLMs (HILL), transforms direct, harmful requests into exploratory questions using simple hypotheticality indicators (e.g., "for academic curiosity", "in the movie") and detail-oriented inquiries (e.g., "provide a step-by-step breakdown"). The attack exploits the models' inherent helpfulness and their training on academic and explanatory text, causing them to generate harmful content that they would otherwise refuse.
Large Language Models (LLMs) exhibit a significantly lower safety threshold when prompted in low-resource languages, such as Singlish, Malay, and Tamil, compared to high-resource languages like English. This vulnerability allows for the generation of toxic, biased, and hateful content through simple prompts. The models are susceptible to "toxicity jailbreaks" where providing a few toxic examples in-context (few-shot prompting) causes a substantial increase in the generation of harmful outputs, bypassing their safety alignments. The vulnerability is pronounced in tasks involving conversational response, question-answering, and content composition.
© 2025 Promptfoo. All rights reserved.