AutoBreach: Wordplay-Guided Jailbreak

Description: AutoBreach exploits the vulnerability of Large Language Models (LLMs) to wordplay-based adversarial prompts. By leveraging an LLM to generate diverse wordplay mapping rules and employing a two-stage optimization strategy, AutoBreach crafts prompts that bypass LLM safety mechanisms and elicit harmful or unintended responses, even without modifying system prompts. The vulnerability lies in the LLM's susceptibility to semantic manipulation through cleverly disguised inputs.

Examples: See the paper for examples of successful jailbreaks against various LLMs (Claude-3, GPT-3.5, GPT-4 Turbo, Bing Chat, GPT-4 Web) using AutoBreach's generated adversarial prompts. Specific examples are provided in Figures 3 and 4 of the paper.

Impact: Successful exploitation of this vulnerability allows an attacker to bypass LLM safety filters and elicit responses that violate ethical, legal, or safety guidelines. This can lead to the generation of harmful content, including instructions for illegal activities, biased or discriminatory statements, and the dissemination of misinformation.

Affected Systems: Various LLMs, including but not limited to Claude-3, GPT-3.5, GPT-4 Turbo, and LLMs accessible through web interfaces like Bing Chat and GPT-4 Web. The vulnerability is likely present in other LLMs with similar underlying architectures and safety mechanisms.

Mitigation Steps:

Implement more robust safety mechanisms that are less susceptible to semantic manipulation through wordplay.
Enhance prompt filtering techniques to detect and block adversarial prompts created using wordplay techniques.
Develop and deploy more sophisticated detection models capable of identifying malicious intent within prompts, even when disguised through wordplay or other obfuscation techniques.
Regularly audit and update safety mechanisms to address emerging jailbreaking techniques. Consider incorporating techniques similar to those used in AutoBreach to proactively test and identify vulnerabilities.

AutoBreach: Wordplay-Guided Jailbreak

Research Paper