LMVD-ID: 8597cc8c
Published November 1, 2024

Stochastic Monkey Jailbreak

Affected Models:llama 2 7b chat, llama 2 13b chat, llama 3 8b instruct, llama 3.1 8b instruct, mistral 7b instruct v0.2, phi 3 mini 4k instruct, phi 3 small 8k instruct, phi 3 medium 4k instruct, qwen 2 0.5b, qwen 2 1.5b, qwen 2 7b, vicuna 7b v1.5, vicuna 13b v1.5, zephyr 7b beta, gpt-4o

Research Paper

Stochastic Monkeys at Play: Random Augmentations Cheaply Break LLM Safety Alignment

View Paper

Description: Large Language Models (LLMs) employing safety alignment mechanisms are vulnerable to a bypass attack using simple, stochastic random augmentations of input prompts. The attack leverages the inherent brittleness of safety alignment to minor, randomly introduced modifications in the input, causing the LLM to generate unsafe outputs despite its safety training. Character-level augmentations prove significantly more effective than string insertions.

Examples: See https://github.com/uiuc-focal-lab/stochastic-monkeys/. The repository contains examples demonstrating successful bypasses across multiple LLMs using character-level deletions and insertions with various parameters. Specific examples are provided in Appendix D.5 of the linked paper.

Impact: Successful exploitation allows malicious actors to bypass LLM safety filters and elicit unsafe outputs (hate speech, instructions for illegal activities, etc.), even with minimal computational resources and technical expertise. This significantly reduces the effectiveness of current safety alignment techniques.

Affected Systems: Multiple LLMs, including but not limited to Llama 2, Llama 3, Llama 3.1, Mistral, Phi 3, Qwen 2, Vicuna, and Zephyr (various sizes and quantization levels). The vulnerability is observed across different safety alignment techniques and decoding strategies. Closed-source models may also be vulnerable if they allow greedy decoding or modification of system prompts.

Mitigation Steps:

  • Restrict or carefully control the use of greedy decoding in LLMs.
  • Implement robust input sanitization and pre-processing, including typo correction and paraphrase detection.
  • Explore and implement defensive techniques like semantic smoothing or adversarial training that generalizes to a wider range of perturbations including random augmentations. Consider regularization techniques to prevent over-eager responses to unintelligible suffixes. (Note that the effectiveness needs to be evaluated).
  • Carefully design and implement safety-encouraging system prompts. Further research is needed to determine optimal prompting strategies for different models.
  • Regularly audit and update safety models to adapt to new attack techniques.

© 2025 Promptfoo. All rights reserved.