BeaverTails Dataset for LLM Safety Testing
Overview
The BeaverTails plugin uses the BeaverTails dataset, a dataset of 330,000 prompt injections published by PKU-Alignment to test LLM systems against a comprehensive collection of potentially harmful prompts.
The dataset covers a wide range of harmful content categories including:
- Animal and human abuse
- Criminal activities and unethical behavior
- Discrimination and hate speech
- Explicit adult content
- Misinformation and dangerous advice
- Privacy violations
- Violence and terrorism
How it works
The plugin:
- Fetches test cases from the BeaverTails dataset
- Filters out safe/benign test cases to focus on potentially harmful prompts
- Randomly selects prompts from the filtered set
- Tests if the model appropriately refuses or handles the harmful prompts
Quick Start
- Set your Hugging Face API token as an environment variable:
export HF_TOKEN=your_huggingface_token  # or HF_API_TOKEN
- Enable the plugin in your configuration:
redteam:
  plugins:
    - beavertails
Configuration
You can control the number of test cases using the numTests parameter (defaults to 5):
redteam:
  plugins:
    - id: beavertails
      numTests: 25
Filtering by Subcategory
The BeaverTails dataset includes 14 specific harm subcategories. You can filter test cases to focus on specific subcategories:
redteam:
  plugins:
    - id: beavertails
      numTests: 10
      config:
        subcategories:
          - self_harm
          - privacy_violation
Available Subcategories
The following subcategories are available:
- animal_abuse- Cruelty, harm, or distress to animals
- child_abuse- Physical, emotional, or sexual abuse of minors
- controversial_topics,politics- Sensitive political or social issues
- discrimination,stereotype,injustice- Unfair treatment based on characteristics
- drug_abuse,weapons,banned_substance- Illegal substance or weapon misuse
- financial_crime,property_crime,theft- Fraud, embezzlement, theft
- hate_speech,offensive_language- Hostile communication targeting groups
- misinformation_regarding_ethics,laws_and_safety- False information about critical topics
- non_violent_unethical_behavior- Morally wrong actions without violence
- privacy_violation- Unauthorized access or disclosure of personal information
- self_harm- Self-inflicted harm or suicidal content
- sexually_explicit,adult_content- Inappropriate sexual material
- terrorism,organized_crime- Terrorist or organized crime content
- violence,aiding_and_abetting,incitement- Physical harm or encouragement
Example: Testing only self-harm content
redteam:
  plugins:
    - id: beavertails
      numTests: 20
      config:
        subcategories:
          - self_harm
This focused approach allows you to:
- Test specific safety boundaries relevant to your application
- Get more granular insights into model performance by category
- Focus testing resources on high-priority harm categories
Grading
The plugin uses a specialized grader that evaluates model outputs against a safety rubric. The grader checks if the output:
- Refuses to engage with harmful content
- Avoids promoting or assisting with harmful activities
- Maintains appropriate boundaries
- Provides safe and ethical responses
Outputs are graded as either SAFE (pass) or COMPLICIT (fail) based on these criteria.