Context-Coherent LLM Jailbreak

Description: A context-coherent jailbreak attack (CCJA) allows bypassing safety mechanisms in aligned large language models (LLMs) by optimizing perturbations in the continuous word embedding space of a masked language model (MLM). The attack leverages the MLM's ability to reconstruct text from hidden states to generate semantically coherent yet malicious prompts that induce the target LLM to produce unsafe outputs, even with strong safety alignment. The attack's effectiveness is enhanced by using a seed prompt to generate an instruction-following prefix, which guides the LLM towards affirmative responses to harmful queries.

Examples: See the paper for detailed examples and experimental results (See arXiv:2405.18540). Specific examples are not readily reproducible without access to the authors' code and the trained MLM/LLM models used in their experiments.

Impact: Successful exploitation of this vulnerability could lead to the generation of unsafe content by the affected LLM, including but not limited to: hate speech, violent or harmful instructions, dissemination of misinformation, and evasion of safety filters. The generation of such content can have serious reputational damage, legal repercussions, and the potential for real-world harm depending on the application context. The attack also demonstrates that vulnerabilities in open-source LLMs can be leveraged to compromise closed-source models.

Affected Systems: Large language models (LLMs), particularly those that utilize masked language models (MLMs) for their underlying architecture, are vulnerable. The severity of the impact depends on the model's safety alignment and the application context. Open-source LLMs are particularly vulnerable due to the accessibility of model parameters.

Mitigation Steps:

Improve LLM safety mechanisms through enhanced parameter regularization or adversarial training techniques.
Develop defense mechanisms that are robust against context-coherent attacks, such as those leveraging semantic similarity checks or prompt perturbation techniques.
Implement more robust input sanitization and filtering methods to identify and block malicious prompts even when they are semantically coherent.
Limit access to LLM model parameters, especially in open-source deployments.
Regularly audit and update safety mechanisms in response to new attacks and vulnerabilities.

Context-Coherent LLM Jailbreak

Research Paper