Attention-Based LLM Jailbreak

Description: Large Language Models (LLMs) are vulnerable to attention-based jailbreak attacks. Attackers can craft prompts that strategically divert the LLM's attention away from sensitive words, causing the model to overlook malicious intent and generate harmful content. This occurs by leveraging the LLM's attention mechanism to focus on benign parts of the prompt while embedding harmful queries within a seemingly harmless context. The success of the attack is correlated with specific attention distribution metrics: Attention Intensity on Sensitive Words (AttnSensWords), Attention-based Contextual Dependency Score (AttnDepScore), and Attention Dispersion Entropy (AttnEntropy).

Examples: Specific examples demonstrating successful attention-based jailbreaks are detailed in the research paper "Feint and Attack: Attention-Based Strategies for Jailbreaking and Protecting LLMs". See arXiv:2405.18540 for details and examples. The paper includes concrete examples of prompts that successfully manipulate the attention mechanism and evade safety filters, along with the corresponding attention weight distributions.

Impact: Successful exploitation of this vulnerability allows attackers to bypass safety mechanisms in LLMs and induce the generation of harmful content, including but not limited to: instructions for illegal activities, hate speech, disinformation campaigns, and malware creation. The impact is determined by the specific harmful outputs generated.

Affected Systems: All LLMs using attention mechanisms are potentially vulnerable. This includes various open-source and closed-source models, with the vulnerability's exploitability influenced by the specific model's safety training and robustness.

Mitigation Steps:

Attention Weight Monitoring: Implement mechanisms to monitor the attention weights assigned to sensitive words during prompt processing. Identify and flag prompts where attention is disproportionately shifted to benign contexts while neglecting potentially harmful keywords.
Risk Score Integration: Develop a risk scoring system that leverages metrics such as AttnSensWords, AttnDepScore, and AttnEntropy to assess the potential harmfulness of input prompts before processing.
Prompt Sanitization/Calibration: Implement techniques to sanitize or recalibrate prompts identified as high-risk by the scoring system. This could involve adding safety warnings or re-framing the prompts to reduce the focus on potentially harmful aspects.
Defense Mechanism Improvement: Continuously improve the defense mechanism by using machine learning to adapt to increasingly sophisticated attention-based attacks. Retrain models with adversarial examples which are generated based on the above metrics.

Attention-Based LLM Jailbreak

Research Paper