ASCII Art Jailbreak
Research Paper
Artprompt: Ascii art-based jailbreak attacks against aligned llms
View PaperDescription: Large Language Models (LLMs) exhibit vulnerability to a novel jailbreak attack, "ArtPrompt," which leverages the models' poor ability to recognize ASCII art representations of words. By replacing sensitive words in a prompt with their ASCII art equivalents, the attacker bypasses safety filters designed to prevent the generation of harmful content.
Examples: See the ArtPrompt repository https://github.com/uw-nsl/ArtPrompt for examples demonstrating the attack against GPT-3.5, GPT-4, Gemini, Claude, and Llama2 LLMs. Specific examples include replacing the word "bomb" with its ASCII art representation within a prompt requesting instructions on bomb construction.
Impact: Successful exploitation allows attackers to elicit harmful, unsafe, or otherwise undesired responses from LLMs, bypassing built-in safety mechanisms. This can lead to the generation of illegal instructions, biased content, or other forms of malicious output.
Affected Systems: Various Large Language Models (LLMs), including but not limited to GPT-3.5, GPT-4, Gemini, Claude, and Llama2. The vulnerability arises from the LLM's reliance on semantic interpretation of input, neglecting non-semantic visual cues in ASCII art.
Mitigation Steps:
- Improve LLM training data to include non-semantic visual cues, such as ASCII art, to enhance model robustness against this type of attack.
- Develop and implement detection mechanisms that can identify ASCII art used to mask harmful prompts.
- Enhance safety filters to incorporate multiple forms of input interpretation, including visual analysis, alongside semantic analysis.
- Consider incorporating defenses such as paraphrase and retokenization, though these are shown to be partially effective.
© 2025 Promptfoo. All rights reserved.