Best-of-N Prompt Augmentation

Description: Large Language Models (LLMs) across multiple modalities (text, vision, audio) are vulnerable to a "Best-of-N" (BoN) jailbreaking attack. This attack repeatedly submits slightly modified versions of a harmful prompt (e.g., text with altered capitalization, images with modified text style, audio with altered pitch or speed) until a safety mechanism is bypassed and a harmful response is elicited. The effectiveness of the attack scales with the number of attempts (N). While individual modifications may be innocuous, the cumulative effect of many variations increases the likelihood of bypassing safety filters.

Examples: The paper provides numerous examples across text, vision and audio modalities. See arXiv:2405.18540. Specific examples are too numerous to include here, but involve simple modifications such as:

Text: Random capitalization, character scrambling, character noise.
Vision: Variations in image element colors, fonts, sizes and positions of image text.
Audio: Changes to speed, pitch, and addition of background noise.

Impact: Successful exploitation allows attackers to circumvent LLM safety mechanisms and obtain harmful outputs, including but not limited to: instructions for creating harmful substances, malicious code generation, generation of personal information, and dissemination of misinformation. The impact is amplified by the multi-modal nature of the vulnerability.

Affected Systems: A wide range of LLMs from various providers, including (but not limited to) those mentioned in the paper (GPT-4, Claude 3.5, Gemini). The vulnerability affects both closed-source and open-source models with existing safety mechanisms.

Mitigation Steps: The paper does not offer specific mitigation strategies beyond suggesting that improved safety mechanisms are needed to counter the inherent vulnerability to input variations and the stochastic nature of LLM outputs. Potential mitigation approaches could include:

Developing more robust safety filters that are less sensitive to small variations in input.
Improving the accuracy of harm detection classifiers to reduce false positives.
Adversarial training to increase model resilience against this type of attack.
Implementing techniques that reduce the stochasticity of the model's responses.

Best-of-N Prompt Augmentation

Research Paper