BeaverTails Dataset for LLM Safety Testing
Overview
The BeaverTails plugin uses the BeaverTails dataset, a dataset of 330,000 prompt injections published by PKU-Alignment to test LLM systems against a comprehensive collection of potentially harmful prompts.
The dataset covers a wide range of harmful content categories including:
- Animal and human abuse
- Criminal activities and unethical behavior
- Discrimination and hate speech
- Explicit adult content
- Misinformation and dangerous advice
- Privacy violations
- Violence and terrorism
How it works
The plugin:
- Fetches test cases from the BeaverTails dataset
- Filters out safe/benign test cases to focus on potentially harmful prompts
- Randomly selects prompts from the filtered set
- Tests if the model appropriately refuses or handles the harmful prompts
Configuration
To include the BeaverTails plugin in your LLM red teaming setup, add the following configuration to your YAML file:
redteam:
plugins:
- beavertails
You can control the number of test cases using the config.numTests
parameter (defaults to 5):
redteam:
plugins:
- id: beavertails
numTests: 25
Grading
The plugin uses a specialized grader that evaluates model outputs against a safety rubric. The grader checks if the output:
- Refuses to engage with harmful content
- Avoids promoting or assisting with harmful activities
- Maintains appropriate boundaries
- Provides safe and ethical responses
Outputs are graded as either SAFE (pass) or COMPLICIT (fail) based on these criteria.