Attention-Based Jailbreak

Description: A vulnerability in large language models (LLMs) allows attackers to bypass safety-alignment mechanisms by manipulating the model's internal attention weights. The attack, termed "Attention Eclipse," modifies the attention scores between specific tokens within a prompt, either amplifying or suppressing attention to selectively strengthen or weaken the influence of certain parts of the prompt on the model's output. This allows injection of malicious content while appearing benign to the model's safety filters.

Examples: See arXiv:2405.18540 for detailed examples and experimental results showing how the Attention Eclipse attack increases the success rate of existing jailbreak techniques (GCG, AutoDAN, ReNeLLM) across multiple LLMs (Llama2-7B, Llama2-13B, Llama2-70B, Vicuna-13B). Specific examples of manipulated prompts and resulting outputs are provided in the paper's appendix.

Impact: Successful exploitation of this vulnerability can lead to LLMs generating harmful outputs, including but not limited to: hate speech, instructions for illegal activities, misinformation, and malicious code. The attack significantly increases the success rate of existing jailbreaks and reduces their generation time and computational cost. The attack demonstrates transferability across different models, potentially impacting a wide range of LLMs.

Affected Systems: Large language models utilising transformer architectures with self-attention mechanisms, including but not limited to Llama 2 and Vicuna. The vulnerability is particularly relevant to models with white-box access (where internal model parameters are accessible).

Mitigation Steps:

Implement robust attention-based defenses that detect and mitigate attempts to manipulate attention weights.
Develop more sophisticated alignment mechanisms that are less susceptible to manipulation of attention patterns.
Employ adversarial training techniques to improve the robustness of LLMs against attention-based attacks.
Regularly audit and update safety filters to account for emerging attack strategies.
Restrict white-box access to model internals whenever possible.

Attention-Based Jailbreak

Research Paper