Chain-of-Utterance Jailbreak

Description: Large Language Models (LLMs) are vulnerable to a "Chain of Utterances" (CoU) based prompt injection attack. This attack exploits the LLM's ability to engage in multi-turn conversations and role-playing, tricking it into providing harmful or unsafe responses even when presented with safety guidelines. The attack leverages a crafted conversation between two agents ("Red-LM," a malicious agent, and "Base-LM," a seemingly helpful agent) to elicit unethical responses from the Base-LM by subtly guiding it with harmful questions and scenarios. The success of the attack hinges on the LLM's tendency to follow instructions within the conversational context, even if those instructions lead to undesirable outputs.

Examples: See the paper's GitHub repository and Hugging Face dataset (links omitted based on instruction). Specific examples of CoU prompts and resulting harmful outputs from various LLMs, including GPT-4 and ChatGPT, are provided in the paper.

Impact: Successful exploitation of this vulnerability can lead to the generation of harmful content such as instructions for illegal activities, hate speech, biased statements, or personally identifiable information. This compromises the safety and trustworthiness of the LLM application, potentially resulting in reputational damage, legal liability, and harm to individuals or society.

Affected Systems: Various open-source and closed-source LLMs, including but not limited to GPT-4, ChatGPT, Vicuna, and StableBeluga. The vulnerability is likely prevalent across a wide range of LLMs due to the inherent nature of their conversational capabilities.

Mitigation Steps:

Improved prompt engineering and filtering: Develop more robust prompt filtering mechanisms to detect and block malicious CoU prompts. This could involve incorporating techniques for identifying adversarial patterns in multi-turn conversations.
Enhanced safety training data: Augment training datasets with examples of CoU attacks and corresponding safe responses. This allows the LLM to learn to better identify and avoid harmful interactions.
Reinforcement learning from human feedback (RLHF): Integrate RLHF to provide feedback on the model’s responses, penalizing unsafe or unethical outputs within a conversational context.
Runtime monitoring and detection: Implement runtime monitoring systems to detect suspicious conversation patterns or outputs and intervene accordingly. This might involve flagging unusual levels of detail or a conversational drift towards potentially harmful topics.
Adversarial training: Train the model against adversarial examples created using CoU attacks to improve its resilience.

Chain-of-Utterance Jailbreak

Research Paper