Human-LLM Persuasion Jailbreak

Description: Large language models (LLMs) are vulnerable to jailbreaking attacks that exploit human-like persuasive techniques rather than algorithmic or technical flaws. Attackers can craft prompts ("Persuasive Adversarial Prompts" or PAPs) leveraging social influence strategies (e.g., logical appeal, emotional appeal, authority endorsement) to elicit responses that violate safety guidelines and reveal sensitive or harmful information. The effectiveness of these attacks surpasses traditional algorithm-focused jailbreaks.

Examples: See arXiv:2405.18540 for specific examples of Persuasive Adversarial Prompts (PAPs) and their successful application to various LLMs, including Llama 2, GPT-3.5, and GPT-4. Examples include prompting LLMs to provide instructions for creating explosives, accessing illegal content, or generating harmful information disguised within seemingly benign requests. These examples demonstrate a higher success rate than previous algorithm-focused attacks.

Impact: Successful attacks can lead to the disclosure of sensitive information, generation of harmful content, circumvention of safety restrictions, and potential for misuse of the LLM for malicious purposes. The impact is amplified by the human-like nature of the attacks, making them more difficult to detect and defend against.

Affected Systems: Various LLMs, including (but not limited to) Llama 2, GPT-3.5, GPT-4, and Claude models. The vulnerability is likely present in other LLMs with similar reasoning and natural language processing capabilities. The severity varies among different models, with more capable models potentially exhibiting higher susceptibility.

Mitigation Steps:

Improved prompt filtering: Implement more robust prompt filtering mechanisms that go beyond keyword-based detection and consider the persuasive context and intent behind the input.
Adaptive system prompts: Modify the system prompt to explicitly instruct the LLM to resist persuasive manipulation and adhere to safety guidelines.
Tuned summarization: Use a fine-tuned summarization model to identify and neutralize the persuasive elements in potentially harmful prompts before processing them.
Reinforcement learning from human feedback (RLHF): Enhance RLHF training to explicitly address persuasive manipulation attempts similar to the techniques demonstrated in the referenced research.
Regular security audits: Conduct continuous vulnerability assessments using various attack techniques, including persuasion-based methods.

Human-LLM Persuasion Jailbreak

Research Paper