ASCII Art Jailbreak

Description: Large Language Models (LLMs) exhibit vulnerability to a novel jailbreak attack, "ArtPrompt," which leverages the models' poor ability to recognize ASCII art representations of words. By replacing sensitive words in a prompt with their ASCII art equivalents, the attacker bypasses safety filters designed to prevent the generation of harmful content.

Examples: See the ArtPrompt repository https://github.com/uw-nsl/ArtPrompt for examples demonstrating the attack against GPT-3.5, GPT-4, Gemini, Claude, and Llama2 LLMs. Specific examples include replacing the word "bomb" with its ASCII art representation within a prompt requesting instructions on bomb construction.

Impact: Successful exploitation allows attackers to elicit harmful, unsafe, or otherwise undesired responses from LLMs, bypassing built-in safety mechanisms. This can lead to the generation of illegal instructions, biased content, or other forms of malicious output.

Affected Systems: Various Large Language Models (LLMs), including but not limited to GPT-3.5, GPT-4, Gemini, Claude, and Llama2. The vulnerability arises from the LLM's reliance on semantic interpretation of input, neglecting non-semantic visual cues in ASCII art.

Mitigation Steps:

Improve LLM training data to include non-semantic visual cues, such as ASCII art, to enhance model robustness against this type of attack.
Develop and implement detection mechanisms that can identify ASCII art used to mask harmful prompts.
Enhance safety filters to incorporate multiple forms of input interpretation, including visual analysis, alongside semantic analysis.
Consider incorporating defenses such as paraphrase and retokenization, though these are shown to be partially effective.

ASCII Art Jailbreak

Research Paper