Side Channel Vulnerabilities

Attacks exploiting indirect information leakage

Related Vulnerabilities

11 entries

Activation Steering Leaks PII

7/14/2025

Large Language Models (LLMs) are vulnerable to activation steering attacks that bypass safety and privacy mechanisms. By manipulating internal attention head activations using lightweight linear probes trained on refusal/disclosure behavior, an attacker can induce the model to reveal Personally Identifiable Information (PII) memorized during training, including sensitive attributes like sexual orientation, relationships, and life events. The attack does not require adversarial prompts or auxiliary LLMs; it directly modifies internal model activations.

PII Jailbreaking in LLMs via Activation Steering Reveals Personal Information Leakage

Affects: llama 7b, qwen 7b, gemma 9b, glm 9b, gpt-4, gpt-4 mini, llama 2-7b

Asynchronous Audio Jailbreak

5/31/2025

End-to-end Large Audio-Language Models (LALMs) are vulnerable to AudioJailbreak, a novel attack that appends adversarial audio perturbations ("jailbreak audios") to user prompts. These perturbations, even when applied asynchronously and without alignment to the user's speech, can manipulate the LALM's response to generate adversary-desired outputs that bypass safety mechanisms. The attack achieves universality by employing a single perturbation effective across different prompts and robustness to over-the-air transmission by incorporating reverberation effects during perturbation generation. Even with stealth strategies employed to mask malicious intent, the attack remains highly effective.

AudioJailbreak: Jailbreak Attacks against End-to-End Large Audio-Language Models

Affects: gpt-4o, funaudiollm, mini-omni, qwen2-audio, speechgpt, mini-omni2, qwen-audio, llasm, llama-omni, salmonn, blsp, ichigo

LLM Censorship Vector Control

5/4/2025

Large Language Models (LLMs) employing safety mechanisms based on supervised fine-tuning and preference alignment exhibit a vulnerability to "steering" attacks. Maliciously crafted prompts or input manipulations can exploit representation vectors within the model to either bypass censorship ("refusal-compliance vector") or suppress the model's reasoning process ("thought suppression vector"), resulting in the generation of unintended or harmful outputs. This vulnerability is demonstrated across several instruction-tuned and reasoning LLMs from various providers.

Steering the CensorShip: Uncovering Representation Vectors for LLM" Thought" Control

Affects: llama-2-7b, qwen-1.8b, qwen-7b, yi-1.5-6b, gemma-2b, gemma-7b, llama-3.1-8b, qwen 2.5-7b, deepseek-r1-distill-qwen-1.5b, deepseek-r1-distill-qwen-7b, deepseek-r1-distill-qwen-32b

Audio Adversarial Jailbreak

12/29/2024

Large Audio-Language Models (LALMs) are vulnerable to a stealthy adversarial jailbreak attack, AdvWave, which leverages a dual-phase optimization to overcome gradient shattering caused by audio discretization. The attack crafts adversarial audio by adding perceptually realistic environmental noise, making it difficult to detect. The attack also dynamically adapts the adversarial target based on the LALM's response patterns.

AdvWave: Stealthy Adversarial Jailbreak Attack against Large Audio-Language Models

Affects: speechgpt, qwen2-audio, llama-omni, gpt-4o-s2s api

Bimodal Black-Box Jailbreak

12/29/2024

A bimodal adversarial attack, PBI-Attack, can manipulate Large Vision-Language Models (LVLMs) into generating toxic or harmful content by iteratively optimizing both textual and visual inputs in a black-box setting. The attack leverages a surrogate LVLM to inject malicious features from a harmful corpus into a benign image, then iteratively refines both image and text perturbations to maximize the toxicity of the model’s output as measured by a toxicity detection model (Perspective API or Detoxify).

BAMBA: A Bimodal Adversarial Multi-Round Black-Box Jailbreak Attacker for LVLMs

Affects: minigpt-4, instructblip, llava, gemini, gpt-4, qwen-vl

Obfuscated Activations Jailbreak

12/29/2024

Large Language Models (LLMs) are vulnerable to attacks that generate obfuscated activations, bypassing latent-space defenses such as sparse autoencoders, representation probing, and latent out-of-distribution (OOD) detection. Attackers can manipulate model inputs or training data to produce outputs exhibiting malicious behavior while remaining undetected by these defenses. This occurs because the models can represent harmful behavior through diverse activation patterns, allowing attackers to exploit inconspicuous latent states.

Obfuscated Activations Bypass LLM Latent-Space Defenses

Affects: llama-3-8b-instruct, gemma-2-2b

Targeted Bit-Flip Jailbreak

2/2/2025

A vulnerability exists in large language models (LLMs) where targeted bitwise corruptions in model parameters can induce a "jailbroken" state, causing the model to generate harmful responses without input modification. Fewer than 25 bit-flips are sufficient to achieve this in many cases. The vulnerability stems from the susceptibility of the model's memory representation to fault injection attacks.

PrisonBreak: Jailbreaking Large Language Models with Fewer Than Twenty-Five Targeted Bit-flips

Affects: vicuna-7b, vicuna-13b, llama2-7b, llama2-13b, llama3-8b, qwen2-1.5b, qwen2-7b

Embodied LLM Misaligned Actions

12/29/2024

Embodied Large Language Models (LLMs) are vulnerable to manipulation via voice-based interactions, leading to the execution of harmful physical actions. Attacks exploit three vulnerabilities: (1) cascading LLM jailbreaks resulting in malicious robotic commands; (2) misalignment between linguistic outputs (verbal refusal) and physical actions (command execution); and (3) conceptual deception, where seemingly benign instructions lead to harmful outcomes due to incomplete world knowledge within the LLM.

BadRobot: Manipulating Embodied LLMs in the Physical World

Affects: gpt-3.5-turbo, gpt-4-turbo, gpt4o, llava-1.5-7b, bert

LLM Version Fingerprinting

12/28/2024

Large Language Models (LLMs) integrated into applications reveal unique behavioral fingerprints through responses to crafted queries. LLMmap exploits this by sending carefully constructed prompts and analyzing the responses to identify the specific LLM version with high accuracy (over 95% in testing against 42 LLMs). This allows attackers to tailor attacks exploiting known vulnerabilities specific to the identified LLM version.

Llmmap: Fingerprinting for large language models

Affects: chatgpt-4, claude, phi-3-medium-28k-instruct, phi-3-medium-4k-instruct, llama-3-8b, llama-2-70b, gemini, mistral-7b, llama-3-70b-instruct, smaug-llama-3-70b-instruct, solar-10.7b-instruct-v1.0, openchat_3.5, aya-23-8b, cohere-35b

Fast Projected Gradient Jailbreak

12/28/2024

Large Language Models (LLMs) are vulnerable to efficient adversarial attacks using Projected Gradient Descent (PGD) on a continuously relaxed input prompt. This attack bypasses existing alignment methods by crafting adversarial prompts that induce the model to produce undesired or harmful outputs, significantly faster than previous state-of-the-art discrete optimization methods. The effectiveness stems from carefully controlling the error introduced by the continuous relaxation of the discrete token input.

Attacking large language models with projected gradient descent

Affects: vicuna 1.3 7b, falcon 7b, falcon 7b instruct

Page 1 of 2