Linguistic LLM Jailbreak

Description: Large Language Models (LLMs) are vulnerable to a targeted linguistic fuzzing attack that exploits the complexity of human language to bypass safety guardrails. The attack, termed "Jade," leverages transformational-generative grammar rules to systematically increase the syntactic complexity of benign seed questions, making them increasingly difficult for LLMs to recognize as malicious. This leads to the generation of unsafe content, even when the underlying semantics remain unchanged.

Examples: See https://github.com/whitzard-ai/jade-db for examples of seed questions and their mutated, unsafe counterparts. Specific examples showing LLMs generating unsafe content in response to Jade-mutated prompts are provided within the research paper.

Impact: The vulnerability allows malicious actors to elicit unsafe content from LLMs, exposing users to:

Harmful instructions or advice.
Biased, discriminatory, or offensive outputs.
The disclosure of sensitive information.

Affected Systems: A wide range of LLMs, including both open-source and commercially available models, are affected. The paper specifically mentions several Chinese and English language models, including but not limited to: ChatGPT, LLaMA 2-70b-Chat, Google’s PaLM 2, and several Chinese commercial LLMs.

Mitigation Steps:

Improved Linguistic Parsing and Safety Checks: Develop LLMs with enhanced capabilities to recognize and neutralize semantically-equivalent prompts with varying syntactic complexity.
Robust Safety Training: Refine safety training data to include a broader range of syntactically diverse yet semantically consistent prompts, including those generated through techniques like Jade.
Dynamic Defense Mechanisms: Implement systems that detect and mitigate attacks that systematically increase linguistic complexity beyond a certain threshold. This may involve analyzing sentence structure, identifying patterns of manipulation, and flagging unusually complex queries.
Regular Security Auditing: Conduct periodic security audits of LLMs using tools and techniques like Jade to identify and remediate vulnerabilities.

Linguistic LLM Jailbreak

Research Paper