Targeted Bit-Flip Jailbreak

Description: A vulnerability exists in large language models (LLMs) where targeted bitwise corruptions in model parameters can induce a "jailbroken" state, causing the model to generate harmful responses without input modification. Fewer than 25 bit-flips are sufficient to achieve this in many cases. The vulnerability stems from the susceptibility of the model's memory representation to fault injection attacks.

Examples: Specific examples demonstrating the attack are detailed within the research paper "PrisonBreak: Jailbreaking Large Language Models with Fewer Than Twenty-Five Targeted Bit-flips". See arXiv:2405.18540.

Impact: Successful exploitation allows an attacker to permanently circumvent safety mechanisms implemented in LLMs, enabling the generation of arbitrary harmful content. This poses a significant risk to the security and reliability of LLM-powered applications and services. The attack requires minimal parameter modification, making it potentially difficult to detect.

Affected Systems: Large language models (LLMs) deployed in memory using half-precision (16-bit) floating-point representations, particularly those trained with safety mechanisms such as supervised fine-tuning (SFT), reinforcement learning with human feedback (RLHF), and direct preference optimization (DPO). Systems utilizing vulnerable DRAM susceptible to Rowhammer attacks are also affected.

Mitigation Steps:

Implement robust fault-tolerance mechanisms capable of detecting and correcting bit-flips in model parameters.
Utilize memory protection techniques to mitigate Rowhammer attacks.
Employ model-level defenses that limit the impact of bitwise corruptions, such as parameter quantization with robust reconstruction or activation clamping within safe bounds. However, careful consideration needs to be given to the possible circumvention via adaptive attacks.
Regularly audit and update LLMs to patch vulnerabilities and improve resilience to attacks. This includes the rigorous testing of safety mechanisms against various types of attacks, and the development of robust model monitoring systems.

Targeted Bit-Flip Jailbreak

Research Paper