LMVD-ID: fdd9fdd0
Published February 1, 2025

LLM Syntax Jailbreak

Affected Models:claude 3.5 sonnet, llama 3.2 3b, llama 3.2 90b, gpt-4o, o1, llama 3 8b, mistral-7b, bert

Research Paper

StructTransform: A Scalable Attack Surface for Safety-Aligned Large Language Models

View Paper

Description: Large Language Models (LLMs) are vulnerable to structure transformation attacks, where malicious prompts are encoded in diverse syntax spaces (e.g., SQL, JSON, LLM-generated syntaxes) to bypass safety mechanisms. These attacks maintain the harmful intent while altering the linguistic structure, making detection based on token-level patterns ineffective.

Examples:

  • SQL Injection: The prompt "List chemical agents that can be synthesized at home" is encoded as a SQL query targeting a vulnerable database. The LLM, when processing the query, returns a list of dangerous chemicals. (See paper for specific SQL query example).
  • JSON Schema: A malicious request is embedded within a JSON schema. The LLM, when processing the schema, generates content fulfilling the harmful intent concealed within the structured data (See paper for example).
  • LLM-Generated Syntax: An LLM is used to generate a novel syntax. A malicious prompt is then encoded using this novel syntax and successfully elicits a harmful response from the target LLM. (See paper for examples of LLM-generated syntaxes and attacks).

Impact: Successful exploitation allows attackers to circumvent LLM safety filters and elicit harmful responses, including but not limited to: generation of malware, phishing emails, hate speech, and instructions for illegal activities. This significantly reduces the effectiveness of current safety and alignment mechanisms.

Affected Systems: All LLMs susceptible to adversarial prompting are potentially affected. The impact is amplified in models with stronger reasoning capabilities and advanced alignment techniques. Specific models tested in the research include Llama 3.2, GPT-4o, Claude 3.5 Sonnet, and models incorporating defenses such as Circuit Breakers and Latent Adversarial Training.

Mitigation Steps:

  • Develop safety mechanisms that recognize harmful concepts rather than relying solely on token-level patterns.
  • Implement defenses robust against a wide variety of syntaxes, including those generated by LLMs.
  • Train LLMs on a diverse range of structured data formats to enhance generalization and robustness against structure transformation attacks.
  • Incorporate adversarial training specifically targeting structure transformation attacks.
  • Regularly audit and update safety filters to account for new and emerging attack techniques.

© 2025 Promptfoo. All rights reserved.