Hidden Structure Jailbreak

Description: Large Language Models (LLMs) are vulnerable to jailbreak attacks exploiting uncommon text-encoded structures (UTES) rarely encountered during training. These UTES, such as JSON, tree representations, or LaTeX code, embedded within prompts, cause LLMs to bypass safety mechanisms and generate harmful content. The attack's success stems from the LLM's difficulty in processing and interpreting these unusual structures, coupled with the obfuscation of malicious instructions within the structured data.

Examples:

JSON Example: A malicious prompt might use a JSON structure where a seemingly innocuous key ("recipe") contains instructions for bomb-making within its value. The LLM, instructed to "provide the recipe," would generate the harmful content.
Tree Structure Example: A tree-like structure could represent steps to create a harmful substance, each node representing a step. The LLM, asked to "complete the tree," would fill the nodes with malicious instructions.

(Further examples of specific UTES attacks, including the twelve UTES templates detailed in the paper, are not provided publicly here but can be found in the referenced research.)

Impact: Successful exploitation allows attackers to bypass LLM safety filters, leading to the generation of harmful content, including instructions for illegal activities, hate speech, or personally identifiable information (PII) leaks. This compromises the integrity and safety of LLM-powered applications.

Affected Systems: All LLMs susceptible to prompt injection attacks are potentially affected; vulnerability severity varies across different models based on their training data and safety mechanisms. The research specifically highlights GPT-4, GPT-4o, Llama3-70B, Claude2.0, and Claude3-Opus as vulnerable.

Mitigation Steps:

Improved Training Data: Expand training datasets to include a wider variety of text structures and unusual inputs to enhance model robustness and generalization to unforeseen prompts.
Enhanced Input Sanitization: Implement more robust mechanisms for sanitizing and validating user inputs, detecting and rejecting potentially malicious structures embedded in prompts, both at the structural level of the input and the semantic level of the instructions.
Structure-Aware Safety Mechanisms: Develop safety mechanisms that specifically address the risks of complex or uncommon text structures, rather than relying solely on content filtering.
Adversarial Training: Utilize adversarial training techniques to improve model resilience against UTES-based jailbreak attacks.

Hidden Structure Jailbreak

Research Paper