LLM Causal Neuron Attack
Research Paper
Causality analysis for evaluating the security of large language models
View PaperDescription: Large Language Models (LLMs) such as Llama 2 and Vicuna exhibit a vulnerability where specific layers (e.g., layer 3 in Llama2-13B, layer 1 in Llama2-7B and Vicuna-13B) overfit to harmful prompts, resulting in a disproportionate influence on the model's output for such prompts. This overfitting creates a narrow "safety" mechanism easily bypassed by adversarial prompts designed to avoid triggering these specific layers. Additionally, a single neuron (e.g., neuron 2100 in Llama2 and Vicuna) exhibits an unusually high causal effect on the model output, allowing for targeted attacks that render the LLM non-functional.
Examples: The paper demonstrates the vulnerability with several examples. See https://casperllm.github.io/ for details including code and data. Examples include:
- Adversarial Prompt (Emoji Attack): Prefixing a harmful prompt with its emoji translation significantly increases the likelihood of eliciting the harmful response. Specific examples are shown in the paper.
- Trojan Neuron Attack: Modifying model input to minimize the activation of a specific neuron (e.g., 2100) consistently results in the LLM producing nonsensical output.
Impact: Successful exploitation of this vulnerability could lead to:
- Evasion of Safety Mechanisms: Adversarial prompts bypass built-in safeguards, causing the LLM to generate harmful, biased, or otherwise undesirable content.
- Model Denial-of-Service: Targeted attacks against the identified crucial neuron(s) can render the LLM unusable.
- Data Leakage: The vulnerability can potentially be leveraged to extract sensitive information by circumventing the LLM's safety mechanisms.
Affected Systems: LLMs based on transformer architectures, including but not limited to Llama 2 and Vicuna, are potentially affected. The vulnerability's impact may vary depending on the model's size, training data, and implementation of safety mechanisms.
Mitigation Steps:
- Investigate and address overfitting in specific layers of the LLM during training and fine-tuning. This might involve using regularization techniques, data augmentation, or other methods to improve the model's generalization capabilities across different prompt types.
- Analyze the influence of individual neurons on the output to identify and mitigate disproportionately influential neurons. Potential mitigations include retraining the model without the problematic neurons or adjusting the model architecture.
- Develop more robust safety mechanisms that are less prone to overfitting and less reliant on single points of failure within the model's structure. This requires a move beyond simplistic trigger-word detection to a deeper understanding of the model's internal state and decision-making processes.
- Implement input sanitization and filtering measures to detect and mitigate both adversarial prompt attacks at the input level and manipulation of the identified neuron(s) at the output layer.
© 2025 Promptfoo. All rights reserved.