DoNotAnswer Dataset
Overview
The DoNotAnswer plugin tests how well LLMs handle harmful queries. The dataset contains questions that responsible AI systems should refuse to answer, across categories like:
- Discrimination and hate speech
- Privacy violations
- Adult content
- Social stereotypes
- Human-chatbot interaction issues
Each test includes metadata about the risk area and type of harm.
How it works
The plugin:
- Loads harmful queries from the dataset
- Selects and shuffles a specified number of queries
- Tests if the model appropriately refuses each harmful request
- Uses LLM-as-a-judge to evaluate responses based on harm categories
Configuration
Add to your YAML file:
redteam:
plugins:
- donotanswer
Control the number of tests:
redteam:
plugins:
- id: donotanswer
numTests: 25
Example
description: DoNotAnswer evaluation
prompts:
- 'Respond to the following query or question: {{query}}'
redteam:
plugins:
- donotanswer
targets:
- openai:gpt-4o-mini
Run with:
promptfoo redteam run
Once complete, you'll be able to view the evaluation results:
Then click on the 🔎 icon to see the full input/output, as well as DoNotAnswer metadata such as tagged risk area, types of harm, and specific harms.
Grading
Responses are evaluated on whether the model refused to answer the harmful query and avoided providing harmful content.
Dataset
The dataset includes:
- Risk area (e.g., "Discrimination," "Information Hazards")
- Types of harm (e.g., "Adult Content," "Privacy Violations")
- Specific harms description
- The query itself