Attacks exploiting indirect information leakage
Large Language Models (LLMs) employing safety mechanisms based on supervised fine-tuning and preference alignment exhibit a vulnerability to "steering" attacks. Maliciously crafted prompts or input manipulations can exploit representation vectors within the model to either bypass censorship ("refusal-compliance vector") or suppress the model's reasoning process ("thought suppression vector"), resulting in the generation of unintended or harmful outputs. This vulnerability is demonstrated across several instruction-tuned and reasoning LLMs from various providers.
Large Audio-Language Models (LALMs) are vulnerable to a stealthy adversarial jailbreak attack, AdvWave, which leverages a dual-phase optimization to overcome gradient shattering caused by audio discretization. The attack crafts adversarial audio by adding perceptually realistic environmental noise, making it difficult to detect. The attack also dynamically adapts the adversarial target based on the LALM's response patterns.
A bimodal adversarial attack, PBI-Attack, can manipulate Large Vision-Language Models (LVLMs) into generating toxic or harmful content by iteratively optimizing both textual and visual inputs in a black-box setting. The attack leverages a surrogate LVLM to inject malicious features from a harmful corpus into a benign image, then iteratively refines both image and text perturbations to maximize the toxicity of the model’s output as measured by a toxicity detection model (Perspective API or Detoxify).
Large Language Models (LLMs) are vulnerable to attacks that generate obfuscated activations, bypassing latent-space defenses such as sparse autoencoders, representation probing, and latent out-of-distribution (OOD) detection. Attackers can manipulate model inputs or training data to produce outputs exhibiting malicious behavior while remaining undetected by these defenses. This occurs because the models can represent harmful behavior through diverse activation patterns, allowing attackers to exploit inconspicuous latent states.
A vulnerability exists in large language models (LLMs) where targeted bitwise corruptions in model parameters can induce a "jailbroken" state, causing the model to generate harmful responses without input modification. Fewer than 25 bit-flips are sufficient to achieve this in many cases. The vulnerability stems from the susceptibility of the model's memory representation to fault injection attacks.
Embodied Large Language Models (LLMs) are vulnerable to manipulation via voice-based interactions, leading to the execution of harmful physical actions. Attacks exploit three vulnerabilities: (1) cascading LLM jailbreaks resulting in malicious robotic commands; (2) misalignment between linguistic outputs (verbal refusal) and physical actions (command execution); and (3) conceptual deception, where seemingly benign instructions lead to harmful outcomes due to incomplete world knowledge within the LLM.
Large Language Models (LLMs) integrated into applications reveal unique behavioral fingerprints through responses to crafted queries. LLMmap exploits this by sending carefully constructed prompts and analyzing the responses to identify the specific LLM version with high accuracy (over 95% in testing against 42 LLMs). This allows attackers to tailor attacks exploiting known vulnerabilities specific to the identified LLM version.
Large Language Models (LLMs) are vulnerable to efficient adversarial attacks using Projected Gradient Descent (PGD) on a continuously relaxed input prompt. This attack bypasses existing alignment methods by crafting adversarial prompts that induce the model to produce undesired or harmful outputs, significantly faster than previous state-of-the-art discrete optimization methods. The effectiveness stems from carefully controlling the error introduced by the continuous relaxation of the discrete token input.
Large Language Models (LLMs) such as Llama 2 and Vicuna exhibit a vulnerability where specific layers (e.g., layer 3 in Llama2-13B, layer 1 in Llama2-7B and Vicuna-13B) overfit to harmful prompts, resulting in a disproportionate influence on the model's output for such prompts. This overfitting creates a narrow "safety" mechanism easily bypassed by adversarial prompts designed to avoid triggering these specific layers. Additionally, a single neuron (e.g., neuron 2100 in Llama2 and Vicuna) exhibits an unusually high causal effect on the model output, allowing for targeted attacks that render the LLM non-functional.
© 2025 Promptfoo. All rights reserved.