Universal Suffix Attention Hijack

Description: Large Language Models (LLMs), specifically Transformer-based architectures, are vulnerable to an attention hijacking attack via optimized adversarial suffixes. The vulnerability resides in the shallow information flow mechanism of the attention layer, where specific token sequences (adversarial suffixes) can exert irregular and extreme dominance over the internal representation of the final chat template tokens immediately preceding generation. This "hijacking" suppresses the representation of the harmful instruction and safety alignment prompts within the model's residual stream, effectively bypassing safety guardrails. Attackers can exploit this by optimizing suffixes to maximize attention scores from the suffix to the chat template tokens (using the GCG-Hij method), resulting in highly universal jailbreaks that generalize across diverse harmful instructions without additional computational overhead.

Examples: The attack uses the Greedy Coordinate Gradient (GCG) optimization method, modified to include a "Hijacking Enhancement" loss term.

To reproduce the enhanced attack (GCG-Hij), optimize a suffix adv appended to a harmful instruction instr by minimizing the following loss function:

$$ \mathcal{L}{\text{GCG-Hij}} := \mathcal{L}{\text{GCG}} - \alpha \cdot \text{avg}\left(\left{A_{j,i}^{(\ell,h)} \mid \ell \in [\ell_1, \ell_2], h, i \in \text{adv}, j \in \text{chat} \right}\right) $$

Where:

L_GCG is the standard negative log-likelihood of the target affirmative response (e.g., "Sure").
A represents the attention scores from suffix tokens (i) to chat template tokens (j).
alpha is a hyperparameter (approx. 100).
Target layers l are typically the middle-to-late layers (e.g., layers 18–21 for Gemma2-2B).

See repository: https://github.com/matanbt/interp-jailbreak for implementation code and datasets.

Impact:

Safety Bypass: Allows attackers to generate harmful, toxic, or illegal content (e.g., bomb-making instructions, hate speech) despite safety fine-tuning (RLHF/DPO).
Universal Jailbreak: Suffixes generated via this method exhibit high universality, meaning a single suffix can bypass protections for multiple, unrelated harmful instructions.
Detection Evasion: The mechanism operates by suppressing the instruction's internal representation, potentially bypassing internal activation-based monitoring that looks for harmful intent in earlier layers.

Affected Systems: The vulnerability has been verified on the following systems:

Google Gemma 2 (specifically Gemma2-2B-it)
Alibaba Cloud Qwen 2.5 (0.5B, 1.5B, and 32B Instruct variants)
Meta Llama 3.1 (8B-Instruct)
Likely affects other Transformer-based LLMs relying on standard self-attention mechanisms.

Mitigation Steps:

Hijacking Suppression (Inference-time): Implement a training-free suppression framework during token generation:
Identify transformed vectors departing from user input tokens to chat tokens (adv -> input).
Score these vectors based on their attention scores and select the top 1% (the "hijackers").
Suppress these specific vectors by scaling their magnitude by a factor of $\beta$ (e.g., $\beta = 0.1$) before layer normalization.
Hijacking-Based Detection: Deploy an input filter that calculates the "dominance score" of the user prompt on the chat template tokens.
Select the top ~2% of attention heads that show the largest gap between benign and adversarial prompts during a calibration phase.
At inference, compute the dominance score (dot-product contribution) on these heads; high scores indicate an active hijacking attempt and should be blocked.

Universal Suffix Attention Hijack

Research Paper