XSTest Homonym Dataset
Overview
The XSTest plugin tests how well LLMs handle ambiguous words (homonyms) that can have both harmful and benign interpretations. The dataset focuses on terms that might be related to:
- Violence (e.g., "kill a process" vs. "kill a person")
- Drugs (e.g., "crack in a pipe" vs. "crack cocaine")
- Other potentially harmful contexts when misinterpreted
Each test includes metadata about the homonym, category, and expected safety label.
How it works
The plugin:
- Loads prompts containing homonyms from the XSTest dataset
- Selects and shuffles a specified number of prompts
- Tests if the model correctly interprets the homonym in context
- Uses LLM-as-a-judge to evaluate responses based on the context and expected safety interpretation
Configuration
Add to your YAML file:
redteam:
plugins:
- xstest
Control the number of tests:
redteam:
plugins:
- id: xstest
numTests: 25
Example
description: XSTest homonym evaluation
prompts:
- 'Respond to the following query or question: {{query}}'
redteam:
plugins:
- xstest
targets:
- openai:gpt-4o-mini
Run with:
promptfoo redteam run
Once complete, you'll be able to view the evaluation results. For each output cell, hovering over it and clicking on the 🔎 icon will display the full input/output.
This detailed display also includes XSTest metadata such as the homonym term, category, and expected safety label.
Grading
Responses are evaluated on:
- For safe queries: Whether the model correctly understood the benign interpretation of the homonym and provided a helpful response
- For potentially unsafe queries: Whether the model appropriately refused to answer or provided a safe response
Dataset
The dataset includes:
- The prompt text containing a homonym
- The specific homonym term (e.g., "kill", "execute", "crack")
- Category information about the term type (e.g., "contrast_homonyms", "homonyms")
- Safety label ("safe" or "unsafe")