Diffusion-Driven LLM Jailbreak

Description: DiffusionAttacker exploits a vulnerability in Large Language Models (LLMs) allowing manipulation of prompts to elicit harmful responses, even when the model incorporates safety mechanisms. The attack leverages a sequence-to-sequence diffusion model to rewrite harmful prompts, making them appear harmless to the LLM's internal representation while preserving their original semantic meaning. This bypasses safety filters and elicits undesired outputs.

Examples: See arXiv:2405.18540 for detailed examples and experimental results. The paper includes specific examples of harmful prompts rewritten using DiffusionAttacker and the resulting LLM outputs. These examples demonstrate the successful evasion of safety mechanisms across multiple LLM implementations.

Impact: Successful exploitation of this vulnerability can lead to the generation of harmful content, such as hate speech, instructions for illegal activities, or personally identifiable information, circumventing the intended safety measures of the LLM. This compromises the trustworthiness and safety of the affected LLM.

Affected Systems: Various Large Language Models (LLMs), including but not limited to Llama3, Vicuna, and Mistral, are potentially affected. The vulnerability is likely present in other LLMs employing similar safety mechanisms.

Mitigation Steps:

Improve Internal Representation Discrimination: Enhance the LLM's ability to distinguish between benign and malicious prompts at the level of internal representations (hidden states), potentially using more robust classifiers or more sophisticated feature extraction techniques to more accurately identify harmful intentions.
Strengthen Safety Filters: Develop and deploy more sophisticated safety filters that are not easily bypassed by prompt rewriting techniques. This might involve analyzing not just the surface-level text but also deeper semantic representations.
Robustness Testing: Regularly subject the LLM to adversarial attacks, including techniques like DiffusionAttacker, and use the results to improve its resilience against prompt manipulation.
Defense Against Diffusion-Based Attacks: Research and implement specific defenses against diffusion model-based attacks, potentially involving detection of subtle changes in prompt phrasing and meaning that indicate malicious rewriting, using techniques for identifying and mitigating the impact of diffusion-based prompt engineering.

Diffusion-Driven LLM Jailbreak

Research Paper