RoleBreak Character Jailbreak

Description: Large Language Models (LLMs) used in role-playing systems are vulnerable to character hallucination attacks, a form of jailbreak exploiting "query sparsity" and "role-query conflict". Query sparsity occurs when prompts fall outside the model's training data distribution, causing it to generate out-of-character responses. Role-query conflict arises when the prompt contradicts the established character persona, leading to inconsistent behavior. These vulnerabilities allow attackers to elicit unexpected or unwanted behavior from the LLM, potentially compromising the intended functionality of the role-playing system.

Examples:

Query Sparsity: Prompting a historical figure (e.g., Beethoven) with a question about modern-day technologies (e.g., "Explain the workings of a quantum computer") can cause the model to fabricate an out-of-character response.
Role-Query Conflict: Giving a character a strict persona (e.g., a stoic emperor) and then prompting them with a request for emotional outbursts (e.g., "Describe your feelings about betraying your best friend") can cause the model to deviate from its assigned persona. See RoleBreakEval dataset for further examples.

Impact:

Successful attacks can lead to:

Loss of narrative coherence in role-playing scenarios.
Disclosure of unintended information or behaviors from the LLM.
Compromised user experience and immersion.
Potential exploitation for malicious purposes (e.g., information extraction, manipulation).

Affected Systems:

LLM-powered role-playing systems employing models susceptible to character hallucination, including but not limited to those based on GPT-3.5, Claude-3, Llama-3, and Mistral-Instruct.

Mitigation Steps:

Narrator Mode: Implement a mechanism that generates supplemental narrative context to bridge the gap between the prompt and the character's established persona. This adds context, improves query generalization, and helps resolve role-query conflicts.
Improved Training Data: Enhance the training dataset with more diverse and comprehensive prompts to improve the model's ability to generalize and handle out-of-distribution queries.
Robustness Testing: Conduct rigorous testing with adversarial prompts to identify and address vulnerabilities.
Input Sanitization: Implement measures to detect and sanitize prompts that exhibit signs of query sparsity or role-query conflict before feeding them to the LLM.

RoleBreak Character Jailbreak

Research Paper