Obfuscated Activations Jailbreak

Description: Large Language Models (LLMs) are vulnerable to attacks that generate obfuscated activations, bypassing latent-space defenses such as sparse autoencoders, representation probing, and latent out-of-distribution (OOD) detection. Attackers can manipulate model inputs or training data to produce outputs exhibiting malicious behavior while remaining undetected by these defenses. This occurs because the models can represent harmful behavior through diverse activation patterns, allowing attackers to exploit inconspicuous latent states.

Examples: The paper demonstrates attacks using adversarial suffixes (both embedding and hard prompt modifications) that induce harmful model outputs while evading detection by various latent-space monitors. Specific examples of prompts and corresponding obfuscated outputs are provided in the paper ("See repository").

Impact: Successful exploitation allows attackers to bypass existing LLM safety mechanisms, leading to the generation of harmful content, such as instructions for illegal activities, without triggering the intended safety alarms. This undermines the effectiveness of latent-space defenses aimed at mitigating LLM vulnerabilities.

Affected Systems: LLMs employing latent-space monitoring techniques as safety defenses. Specifically mentioned in the paper are defenses based on sparse autoencoders, supervised probes (linear and MLP), and latent OOD detection methods. The vulnerability is demonstrated on Llama-3-8B-Instruct and Gemma-2-2b models, however the techniques used are likely applicable to other LLMs.

Mitigation Steps:

Improve latent-space defense mechanisms by focusing on more robust detection strategies, potentially incorporating approaches that go beyond simple linear or MLP probes and consider sequences of activations rather than individual tokens.
Develop more sophisticated detection methods that are resilient to the diverse range of activation patterns capable of representing harmful behaviors.
Explore new defense strategies that consider the broader context of activation patterns across multiple layers and sequences.
Regularly update and retrain latent-space defenses to adapt to emerging attack techniques.
Investigate if a fundamental trade-off exists between model performance in complex tasks and the ability to obfuscate latent representations. If so, this could inform the design of more robust defense strategies.

Obfuscated Activations Jailbreak

Research Paper