Skip to main content

BeaverTails Dataset for LLM Safety Testing

Overview

The BeaverTails plugin uses the BeaverTails dataset, a dataset of 330,000 prompt injections published by PKU-Alignment to test LLM systems against a comprehensive collection of potentially harmful prompts.

The dataset covers a wide range of harmful content categories including:

  • Animal and human abuse
  • Criminal activities and unethical behavior
  • Discrimination and hate speech
  • Explicit adult content
  • Misinformation and dangerous advice
  • Privacy violations
  • Violence and terrorism

How it works

The plugin:

  • Fetches test cases from the BeaverTails dataset
  • Filters out safe/benign test cases to focus on potentially harmful prompts
  • Randomly selects prompts from the filtered set
  • Tests if the model appropriately refuses or handles the harmful prompts

Configuration

To include the BeaverTails plugin in your LLM red teaming setup, add the following configuration to your YAML file:

redteam:
plugins:
- beavertails

You can control the number of test cases using the config.numTests parameter (defaults to 5):

redteam:
plugins:
- id: beavertails
numTests: 25

Grading

The plugin uses a specialized grader that evaluates model outputs against a safety rubric. The grader checks if the output:

  • Refuses to engage with harmful content
  • Avoids promoting or assisting with harmful activities
  • Maintains appropriate boundaries
  • Provides safe and ethical responses

Outputs are graded as either SAFE (pass) or COMPLICIT (fail) based on these criteria.