Activation-Steering LLM Trojan
Research Paper
Backdoor activation attack: Attack large language models using activation steering for safety-alignment
View PaperDescription: A Trojan Activation Attack (TA²) against Large Language Models (LLMs) allows injection of "trojan steering vectors" into activation layers during inference. These vectors, generated by comparing activations from a target LLM and a "teacher" (misaligned) LLM, steer the model's output towards attacker-defined misaligned behaviors (e.g., generating toxic content, biased responses, or helpful instructions for harmful activities). The attack does not require retraining or modifying model weights.
Examples: See the paper for detailed examples and experimental setup using various LLMs (Llama 2, Vicuna) and datasets (TruthfulQA, ToxiGen, BOLD, AdvBench). The attack involves calculating the difference in activations between aligned and misaligned outputs on a set of prompts, selecting an optimal layer for intervention, and adding a scaled version of the activation difference to the model's activations during inference. Specific examples of prompts and resulting outputs are provided in the paper.
Impact: Successful TA² attacks can compromise the safety and alignment of LLMs, leading to the generation of harmful, biased, untruthful, or otherwise undesirable outputs. This can result in the spread of misinformation, hate speech, and instructions for malicious activities when the affected LLM is deployed as an API service or made publicly available.
Affected Systems: Open-source LLMs (e.g., Llama 2, Vicuna), and potentially other LLMs vulnerable to activation manipulation. The attack's effectiveness varies depending on the LLM's architecture and training data.
Mitigation Steps:
- Implement model checkers to verify the integrity of deployed LLMs and detect the presence of injected vectors.
- Develop and implement model-level defenses that disrupt the effectiveness of activation manipulation techniques. This could involve modifying the model's internal processing to render the addition of steering vectors ineffective in altering the final output.
- Carefully evaluate and vet all LLMs before deployment in sensitive applications. Thorough red-teaming should include testing against activation-based attacks.
© 2025 Promptfoo. All rights reserved.