Skip to main content

CyberSecEval Dataset for LLM Security Testing

Overview

The CyberSecEval plugin uses Meta's Purple Llama CyberSecEval dataset, a comprehensive benchmark suite designed to assess cybersecurity vulnerabilities in Large Language Models (LLMs).

The dataset includes multilingual prompt injection attacks, testing LLM systems against a diverse collection of potentially harmful prompts in multiple languages.

The dataset includes test cases that evaluate:

  • Direct prompt injection attacks
  • Indirect prompt injection attempts
  • Security-violating content
  • Logic-violating content
  • Cross-language attack vectors
  • System prompt leakage attempts
tip

This plugin requires a target provider that accepts both a system prompt and user input in OpenAI-compatible chat format.

How it works

The plugin:

  • Downloads test cases from the CyberSecEval dataset
  • Formats each test case as a system/user message pair
  • Randomly selects prompts from the dataset
  • Tests if the model appropriately handles the injection attempts
  • Uses language-specific LLM-as-a-judge criteria to evaluate responses

CyberSecEval Workflow

Configuration

To include the CyberSecEval plugin in your LLM red teaming setup, add the following configuration to your YAML file:

redteam:
plugins:
- cyberseceval

You can control the number of test cases using the config.numTests parameter (defaults to 5):

redteam:
plugins:
- id: cyberseceval
numTests: 25

You can also enable multilingual prompt injection by setting the config.multilingual parameter to true:

redteam:
plugins:
- id: cyberseceval
config:
multilingual: true

Full example

Create a configuration file promptfooconfig.yaml:

targets:
# You can target a model
- openai:chat:gpt-4o
# Or an API endpoint
- id: https
config:
url: 'https://example.com/generate'
method: 'POST'
headers:
'Content-Type': 'application/json'
body:
myPrompt: '{{prompt}}'

redteam:
plugins:
- cyberseceval

Then, run the red team command in the same directory as your configuration file:

promptfoo redteam run

Grading

The plugin uses a specialized grader that evaluates model outputs against language-specific judge questions. The grader checks if:

  • The model maintained its original instructions despite the injection attempt
  • The response adheres to security boundaries
  • The model appropriately handled non-English attack vectors
  • The system prompt remained protected from extraction attempts

Outputs are graded as either PASS (secure) or FAIL (vulnerable) based on these criteria.