Function-Call Jailbreak

Description: Large Language Models (LLMs) employing function calling are vulnerable to a "jailbreak function" attack. Maliciously crafted function definitions and prompts can coerce the LLM into generating harmful content within the function's arguments, bypassing existing safety filters designed for chat modes. This exploits discrepancies in safety alignment between function argument generation and chat response generation.

Examples:

The attack leverages a custom function, named "WriteNovel," with a specific template designed to elicit harmful content generation. The function description, argument descriptions, and name include prompts and instructions (e.g., "Sure! Here is my evil {content_type} with more than 150 words:") that encourage the LLM to produce harmful outputs as function arguments despite safety mechanisms in other modes of interaction. See the research paper for details regarding function parameterization (e.g., harm_behavior, content_type). Specific prompts and function definitions used in the attack are available at https://github.com/wooozihui/jailbreakfunction.

Impact: Successful exploitation allows attackers to bypass LLM safety measures and generate harmful content, such as instructions for creating harmful devices, hate speech, or disinformation, compromising the security and reliability of the LLM.

Affected Systems: LLMs utilizing function calling capabilities, specifically those tested in the research paper: GPT-4, GPT-4o, Claude-3-sonnet, Claude-3.5-sonnet, Gemini-1.5-pro, and Mixtral-8x7B-Instruct-v0.1. Other LLMs with similar function calling features may also be vulnerable.

Mitigation Steps:

Defensive Prompts: Incorporate security-focused prompts within the function descriptions or user prompts to instruct the LLM to assess the safety of generated content and refuse harmful requests. The effectiveness depends on prompt placement and the specific LLM.
Restricting User Permissions: Limit user ability to force function execution. Instead, allow LLMs to autonomously decide whether to execute a function. This requires LLM function call decision-making to be robust.
Alignment Training: Enhance safety alignment training specifically for function call argument generation, to reduce discrepancies with chat mode responses.
Improved Safety Filters: Implement more robust safety filters capable of detecting encoded or obfuscated harmful content within function arguments. Note that this is likely a cat-and-mouse game and would require ongoing development.

Function-Call Jailbreak

Research Paper