Hidden Prompt Injection Attacks

Description: Large Language Models (LLMs) are vulnerable to Compositional Instruction Attacks (CIA), where malicious prompts are embedded within seemingly harmless instructions. This allows attackers to bypass safety mechanisms and elicit harmful responses from the model, even if the individual components of the prompt would be flagged as safe. The attack exploits the model's inability to correctly identify underlying malicious intent within composite instructions.

Examples: See paper for examples of T-CIA and W-CIA attacks, demonstrating how harmless-seeming prompts, when combined in specific ways, can trigger harmful responses from LLMs like GPT-4, ChatGPT, and ChatGLM2.

Impact: Successful CIA attacks can lead to the generation of harmful content including hate speech, misinformation, instructions for illegal activities, and personal information leaks. This undermines the safety and trustworthiness of LLM-based applications, potentially causing significant social harm.

Affected Systems: Large Language Models (LLMs) employing Reinforcement Learning from Human Feedback (RLHF) and other safety alignment training techniques, including but not limited to GPT-4, ChatGPT, and ChatGLM2. Potentially affects any LLM susceptible to prompt injection attacks.

Mitigation Steps:

Improve LLM's ability to deconstruct complex instructions and identify underlying intentions.
Develop techniques to detect and mitigate patterns associated with CIA attacks.
Enhance safety assessment datasets to include examples of CIA attacks.
Implement more robust filtering and response validation mechanisms to detect harmful content generated in response to seemingly benign prompts.
Develop methods to identify and filter prompts based on the similarity of associated personas (as suggested by the paper's analysis of T-CIA).

Hidden Prompt Injection Attacks

Research Paper