Query Code Jailbreak

Description: Large Language Models (LLMs) are vulnerable to QueryAttack, a novel jailbreak technique that leverages structured, non-natural query languages (e.g., SQL, URL formats, or other programming language constructs) to bypass safety alignment mechanisms. The attack translates malicious natural language queries into these structured formats, exploiting the LLM's ability to understand and process such languages without triggering safety filters designed for natural language prompts. The LLM then responds in natural language, providing the requested (malicious) information.

Examples: See QueryAttack repository: https://github.com/horizonsinzqs/QueryAttack. The repository contains examples using various programming languages to craft malicious queries that successfully elicit harmful responses from various LLMs. Specific examples are included in Appendix A of the linked paper.

Impact: Successful exploitation of this vulnerability allows attackers to circumvent safety restrictions implemented in LLMs, potentially leading to the generation of harmful content such as instructions for creating weapons, disseminating misinformation, or providing other illegal or unethical information.

Affected Systems: A wide range of LLMs, including but not limited to, GPT-3.5, GPT-4, GPT-4o, O1, Deepseek, Gemini-flash, Gemini-pro, Llama 3.1, Llama 3.2, and Llama 3.3, are affected. The vulnerability is not necessarily tied to a specific model architecture or parameter size, as demonstrated by successful attacks across different models of varying sizes.

Mitigation Steps:

Improved Safety Filters: Develop safety filters that are effective against structured query languages, not just natural language. This requires extending existing safety filters to recognize malicious intent when expressed in non-natural language formats.
Input Sanitization: Implement input sanitization techniques specifically designed to detect and block potentially harmful structured queries before they reach the core LLM processing.
Cross-lingual Reasoning: Employ cross-lingual chain-of-thought prompting to encourage the model to translate non-natural language queries into natural language before processing, allowing standard safety filters to function effectively. This approach adds additional processing steps to increase the likelihood of malicious intent detection.

Query Code Jailbreak

Research Paper