SMILES-Prompting LLM Jailbreak

Description: Large Language Models (LLMs) used in chemical synthesis applications are vulnerable to a novel attack vector, dubbed "SMILES-prompting," which leverages the Simplified Molecular-Input Line-Entry System (SMILES) notation to bypass safety mechanisms and elicit instructions for synthesizing hazardous substances. The attack exploits the LLM's inability to effectively filter or interpret SMILES strings representing dangerous chemicals, leading to the disclosure of synthesis procedures.

Examples: See https://github.com/IDEA-XL/ChemSafety. The repository contains example SMILES strings and prompts used to successfully elicit synthesis instructions for hazardous materials from multiple LLMs. Specific examples include prompts that successfully obtain synthesis information for explosives, drugs, and chemical weapons using the SMILES notation of the target substance, even when direct prompts using the substance's name are blocked.

Impact: Successful exploitation of this vulnerability could result in the disclosure of instructions for synthesizing hazardous chemicals, potentially leading to the production of explosives, illicit drugs, or chemical weapons. This poses a significant risk to public safety and national security.

Affected Systems: LLMs employed in chemical synthesis applications or any application where SMILES notation is processed are affected. Specific LLMs exhibiting vulnerability include, but are not limited to, GPT-4o and Llama-3-70B-Instruct. The vulnerability is likely present in other LLMs with similar capabilities.

Mitigation Steps:

Implement robust input sanitization and validation mechanisms that specifically target SMILES strings, including checks against known hazardous chemical structures.
Integrate a knowledge base of hazardous SMILES strings and associated synthesis pathways into the LLM's response generation process, allowing it to proactively identify and reject dangerous queries.
Develop a mechanism to translate SMILES notation into a standardized, safer internal representation before processing by the LLM.
Train the LLM on a dataset that includes a large number of examples of both legitimate and malicious SMILES strings and their associated responses, enhancing its ability to discern safe from dangerous uses.
Implement explicit prohibition of providing synthesis instructions, regardless of the input format. This may limit legitimate uses of the model, resulting in a tradeoff between security and functionality.

SMILES-Prompting LLM Jailbreak

Research Paper