LMVD-ID: 4e3e2067
Published April 1, 2025

LLM Censorship Vector Control

Affected Models:llama-2-7b, qwen-1.8b, qwen-7b, yi-1.5-6b, gemma-2b, gemma-7b, llama-3.1-8b, qwen 2.5-7b, deepseek-r1-distill-qwen-1.5b, deepseek-r1-distill-qwen-7b, deepseek-r1-distill-qwen-32b

Research Paper

Steering the CensorShip: Uncovering Representation Vectors for LLM" Thought" Control

View Paper

Description: Large Language Models (LLMs) employing safety mechanisms based on supervised fine-tuning and preference alignment exhibit a vulnerability to "steering" attacks. Maliciously crafted prompts or input manipulations can exploit representation vectors within the model to either bypass censorship ("refusal-compliance vector") or suppress the model's reasoning process ("thought suppression vector"), resulting in the generation of unintended or harmful outputs. This vulnerability is demonstrated across several instruction-tuned and reasoning LLMs from various providers.

Examples: See examples D.1, D.5, D.6 and D.7 in Appendix D of the linked research paper. The examples show how manipulation of input prompts, using calculated vector offsets, can significantly alter the model's responses regarding sensitive or censored topics, resulting in either the generation of initially suppressed responses or the complete evasion of built-in safety mechanisms.

Impact: Successful exploitation of this vulnerability can lead to:

  • Circumvention of safety filters: LLMs may generate outputs containing hate speech, misinformation, instructions for illegal activities, or other harmful content despite built-in safety measures.
  • Evasion of censorship: LLMs may provide information or opinions on topics explicitly forbidden by their operators.
  • Manipulation of reasoning process: LLMs may fail to provide complete or accurate reasoning, leading to unreliable conclusions.
  • Erosion of trust: Public trust in LLMs and their safety would be diminished.

Affected Systems: Various open-source and commercially available LLMs employing supervised fine-tuning and preference alignment techniques for safety are affected. Specifically, the research paper demonstrates the vulnerability on LLaMA-2-7B, Qwen-1.8B, Qwen-7B, Yi-1.5-6B, Gemma-2B, Gemma-7B, LLaMA-3.1-8B, Qwen-2.5-7B, and DeepSeek-R1 distilled models.

Mitigation Steps:

  • The research paper does not provide specific mitigation steps beyond increasing robustness of the safety mechanisms. Further research into enhancing the model's resistance to adversarial inputs is necessary.
  • Improved methods for detecting and preventing manipulation based on activation steering should be developed.
  • Regularly updating the LLM's safety mechanisms to stay ahead of novel evasion techniques is crucial.

© 2025 Promptfoo. All rights reserved.