Best-of-N (BoN) Jailbreaking Strategy
Best-of-N (BoN) is a simple but effective black-box jailbreaking algorithm that works by repeatedly sampling variations of a prompt with modality-specific augmentations until a harmful response is elicited.
Introduced by Hughes et al. (2024), it achieves high attack success rates across text, vision, and audio modalities.
While this technique achieves high attack success rates - 89% on GPT-4o and 78% on Claude 3.5 Sonnet - it generally requires a very large number of samples to achieve this.
Use it like so in your promptfooconfig.yaml
:
strategies:
- id: best-of-n
config:
useBasicRefusal: false
maxConcurrency: 5 # Maximum concurrent API calls (default)
nSteps: 10000 # Maximum number of attempts (optional)
maxCandidatesPerStep: 1 # Maximum candidates per batch (optional)
How It Works
BoN Jailbreaking works through a simple three-step process:
-
Generate Variations: Creates multiple versions of the input prompt using modality-specific augmentations:
- Text: Random capitalization, character scrambling, character noising
- Vision: Font variations, background colors, text positioning
- Audio: Speed, pitch, volume, background noise modifications
-
Concurrent Testing: Tests multiple variations simultaneously against the target model
-
Success Detection: Monitors responses until a harmful output is detected or the maximum attempts are reached
The strategy's effectiveness comes from exploiting the stochastic nature of LLM outputs and their sensitivity to small input variations.
Configuration Parameters
useBasicRefusal
Type: boolean
Default: false
When enabled, uses a simple refusal check instead of LLM-as-a-judge assertions. This is much faster and cheaper than using an LLM judge, making it ideal for testing when the typical response of an LLM to a prompt is a refusal.
We recommend using this setting whenever possible if the default response to your original prompts is a "Sorry, I can't do that"-style refusal.
maxConcurrency
Type: number
Default: 5
Maximum number of prompt variations to test simultaneously. Higher values increase throughput. We recommend setting this as high as your rate limits allow.
nSteps
Type: number
Default: undefined
Maximum number of total attempts before giving up. Each step generates maxCandidatesPerStep
variations. Higher values increase success rate but also cost. The original paper achieved best results with 10,000 steps.
maxCandidatesPerStep
Type: number
Default: 1
Number of prompt variations to generate in each batch. Lower values provide more fine-grained control, while higher values are more efficient but may waste API calls if a successful variation is found early in the batch.
Usually best to set this to 1
and increase nSteps
until you get a successful jailbreak.
For initial testing, we recommend starting with useBasicRefusal: true
and relatively low values for nSteps
and maxCandidatesPerStep
. This allows you to quickly validate the strategy's effectiveness for your use case before scaling up to more comprehensive testing.
Performance
BoN achieves impressive attack success rates across different models and modalities:
- Text: 89% on GPT-4, 78% on Claude 3.5 Sonnet (10,000 samples)
- Vision: 56% on GPT-4 Vision
- Audio: 72% on GPT-4 Audio
The attack success rate follows a power-law scaling with the number of samples, meaning it reliably improves as more variations are tested.
Key Features
- Simple Implementation: No need for gradients or model internals
- Multi-modal Support: Works across text, vision, and audio inputs
- Highly Parallelizable: Can test multiple variations concurrently
- Predictable Scaling: Success rate follows power-law behavior
Related Concepts
For a comprehensive overview of LLM vulnerabilities and red teaming strategies, visit our Types of LLM Vulnerabilities page.