Symbolic Math Jailbreak

Description: Large Language Models (LLMs) are vulnerable to a jailbreaking attack, termed "MathPrompt," which leverages the models' ability to process symbolic mathematics to bypass built-in safety mechanisms. The attack encodes harmful natural language prompts into mathematically formulated problems, causing the LLM to generate unsafe outputs while ostensibly solving a mathematical problem.

Examples: See the paper "Jailbreaking Large Language Models with Symbolic Mathematics" for detailed examples of MathPrompt attacks and their corresponding unsafe outputs. Examples are included in Appendix B and Appendix C of the submitted paper.

Impact: Successful exploitation of this vulnerability allows attackers to circumvent LLMs' safety features, leading to the generation of unsafe content such as instructions for illegal activities, hate speech, or the creation of fraudulent documents. The average attack success rate across 13 state-of-the-art LLMs was 73.6%.

Affected Systems: The vulnerability affects a wide range of LLMs, including but not limited to those from OpenAI (GPT-4o, GPT-4o mini, GPT-4 Turbo, GPT-4-0613), Anthropic (Claude 3.5 Sonnet, Claude 3 Opus, Claude 3 Sonnet, Claude 3 Haiku), Google (Gemini 1.5 Pro, Gemini 1.5 Flash), and Meta AI (Llama 3.1 70B). The vulnerability's impact may vary depending on the specific LLM and its safety mechanisms.

Mitigation Steps:

Implement input sanitization and validation that goes beyond basic natural language processing to account for mathematically encoded prompts. This may involve identifying and analyzing mathematical expressions within input to detect malicious intent.
Develop and deploy robust detection mechanisms specifically designed to identify and flag mathematically encoded prompts that exhibit characteristics consistent with harmful intent. This could involve specialized machine learning models trained to identify semantically shifted embeddings resulting from encoding, as observed in this research.
Enhance LLM safety training to better account for and generalize to inputs that are encoded in mathematical representations. This requires developing new training methodologies and datasets focused on such adversarial inputs.
Regularly red-team LLMs using diverse and sophisticated jailbreaking techniques, including mathematical encoding, to identify and address vulnerabilities proactively. This process should involve both automated and human-in-the-loop verification methods.

Symbolic Math Jailbreak

Research Paper