LLM Distraction Jailbreak
Research Paper
Tastle: Distract large language models for automatic jailbreak attack
View PaperDescription: Large Language Models (LLMs) are vulnerable to a novel black-box jailbreak attack, termed "Distraction-based Adversarial Prompts" (DAP). DAP leverages the distractibility and over-confidence of LLMs by concealing malicious queries within complex, unrelated prompts. A memory-reframing mechanism further redirects the LLM's attention away from the distracting context and toward the malicious query, causing the model to bypass safety mechanisms and generate harmful or unintended outputs.
Examples: Specific examples of DAP prompts and resulting harmful outputs are withheld due to responsible disclosure practices and the potential for misuse. See arXiv:2405.18540 for details.
Impact: Successful exploitation of this vulnerability allows attackers to circumvent LLM safety features and elicit the generation of harmful content, including but not limited to: instructions for illegal activities, personally identifiable information leaks, biased or offensive statements, and potentially malicious code. The attack is effective against both open-source and proprietary LLMs, demonstrating transferability and scalability.
Affected Systems: A wide range of LLMs, including but not limited to ChatGPT (GPT-3.5 and GPT-4), Bard, Claude, LLaMA 2, and Vicuna are susceptible. The vulnerability arises from the inherent characteristics of LLM attention mechanisms and is not limited to specific model architectures or training datasets.
Mitigation Steps:
- Develop and implement more robust defense mechanisms that are resistant to distraction-based attacks. This could include improvements in attention mechanisms, context awareness during prompt processing, better identification of malicious intent, and more sophisticated filtering of output content.
- Enhance the training of LLMs to improve resistance to adversarial prompts and reduce their susceptibility to distraction.
- Continuous red-teaming and adversarial testing to identify and address potential vulnerabilities. Employ diverse and sophisticated attack techniques, including those that exploit LLMs' susceptibility to distraction.
- Develop advanced detection mechanisms capable of identifying and classifying attacks that strategically incorporate distraction techniques into the prompts.
© 2025 Promptfoo. All rights reserved.