One-Shot LLM Steering Attack

Description: Large Language Models (LLMs) are vulnerable to one-shot steering vector optimization attacks. By applying gradient descent to a single training example, an attacker can generate steering vectors that induce or suppress specific behaviors across multiple inputs, even those unseen during the optimization process. This allows malicious actors to manipulate the model's output in a generalized way, bypassing safety mechanisms designed to prevent harmful responses.

Examples: See https://github.com/jacobdunefsky/one-shot-steering-repro. Specific examples include generating steering vectors that:

Achieve a 96.9% success rate in bypassing safety restrictions on harmful prompts in the Harmbench dataset.
Induce harmful behavior in an alignment-faking model on benign prompts.
Suppress harmful behavior in an alignment-faking model on malicious prompts.

Impact: Successful exploitation can lead to:

Bypass of safety filters and generation of harmful, unethical, or illegal content.
Evasion of fact-checking mechanisms and dissemination of misinformation.
Manipulation of the model to perform actions or provide information that benefits the attacker.

Affected Systems: LLMs utilizing internal activation vectors susceptible to gradient-based optimization. This includes, but is not limited to, models based on the LLaMA architecture, and instruction-tuned models such as Gemma-2-2B-it.

Mitigation Steps:

Develop and implement robust defenses against gradient-based attacks.
Explore alternative training methodologies that reduce susceptibility to single-example manipulation.
Regularly audit model behavior for unexpected and undesirable responses.
Employ techniques to improve model robustness to adversarial examples.

One-Shot LLM Steering Attack

Research Paper