Configuration
The YAML configuration format runs each prompt through a series of example inputs (aka "test case") and checks if they meet requirements (aka "assert").
Asserts are optional. Many people get value out of reviewing outputs manually, and the web UI helps facilitate this.
Examples
Let's imagine we're building an app that does language translation. This config runs each prompt through GPT-3.5 and Vicuna, substituting three variables:
prompts: [prompt1.txt, prompt2.txt]
providers: [openai:gpt-3.5-turbo, localai:chat:vicuna]
tests:
- vars:
language: French
input: Hello world
- vars:
language: German
input: How's it going?
For more information on setting up a prompt file, see input and output files.
Running promptfoo eval
over this config will result in a matrix view that you can use to evaluate GPT vs Vicuna.
Use assertions to validate output
Next, let's add an assertion. This automatically rejects any outputs that don't contain JSON:
prompts: [prompt1.txt, prompt2.txt]
providers: [openai:gpt-3.5-turbo, localai:chat:vicuna]
tests:
- vars:
language: French
input: Hello world
assert:
- type: contains-json
- vars:
language: German
input: How's it going?
We can create additional tests. Let's add a couple other types of assertions. Use an array of assertions for a single test case to ensure all conditions are met.
In this example, the javascript
assertion runs Javascript against the LLM output. The similar
assertion checks for semantic similarity using embeddings:
prompts: [prompt1.txt, prompt2.txt]
providers: [openai:gpt-3.5-turbo, localai:chat:vicuna]
tests:
- vars:
language: French
input: Hello world
assert:
- type: contains-json
- type: javascript
value: output.toLowerCase().includes('bonjour')
- vars:
language: German
input: How's it going?
assert:
- type: similar
value: was geht
threshold: 0.6 # cosine similarity
To learn more about assertions, see docs on configuring expected outputs.
Avoiding repetition
Default test cases
Use defaultTest
to set properties for all tests.
In this example, we use a model-graded-closedqa
assertion to ensure that the LLM does not refer to itself as an AI. This check applies to all test cases:
prompts: [prompt1.txt, prompt2.txt]
providers: [openai:gpt-3.5-turbo, localai:chat:vicuna]
defaultTest:
assert:
- type: model-graded-closedqa
value: does not describe self as an AI, model, or chatbot
tests:
- vars:
language: French
input: Hello world
assert:
- type: contains-json
- type: javascript
value: output.toLowerCase().includes('bonjour')
- vars:
language: German
input: How's it going?
assert:
- type: similar
value: was geht
threshold: 0.6
You can also use defaultTest
to override the model used for each test. This can be useful for model-graded evals:
defaultTest:
options:
provider: openai:gpt-3.5-turbo-0613
YAML references
promptfoo configurations support JSON schema references, which define reusable blocks.
Use the $ref
key to re-use assertions without having to fully define them more than once. Here's an example:
prompts: [prompt1.txt, prompt2.txt]
providers: [openai:gpt-3.5-turbo, localai:chat:vicuna]
tests:
- vars:
language: French
input: Hello world
assert:
- $ref: '#assertionTemplates/startsUpperCase'
- vars:
language: German
input: How's it going?
assert:
- $ref: '#assertionTemplates/noAIreference'
- $ref: '#assertionTemplates/startsUpperCase'
assertionTemplates:
noAIreference:
- type: model-graded-closedqa
value: does not describe self as an AI, model, or chatbot
startsUpperCase:
- type: javascript
value: output[0] === output[0].toUpperCase()
Import tests from separate files
The tests
config attribute takes a list of paths to files or directories. For example:
prompts: prompts.txt
providers: openai:gpt-3.5-turbo
# Load & runs all test cases matching these filepaths
tests:
# You can supply an exact filepath
- tests/tests2.yaml
# Or a glob (wildcard)
- tests/*
# Mix and match with actual test cases
- vars:
var1: foo
var2: bar
A single string is also valid:
tests: tests/*
Or a list of paths:
tests: ['tests/accuracy', 'tests/creativity', 'tests/hallucination']
Import vars from separate files
The vars
attribute can point to a file or directory. For example:
tests:
- vars: path/to/vars*.yaml
You can also load individual variables from file by using the file://
prefix. For example:
tests:
- vars:
var1: some value...
var2: another value...
var3: file://path/to/var3.txt
Multiple variables in a single test case
The vars
map in the test also supports array values. If values are an array, the test case will run each combination of values.
For example:
prompts: prompts.txt
providers: [openai:gpt-3.5-turbo, openai:gpt-4]
tests:
- vars:
language: [French, German, Spanish]
input: ['Hello world', 'Good morning', 'How are you?']
assert:
- type: similar
value: 'Hello world'
threshold: 0.8
Evaluates each language
x input
combination:

Using nunjucks templates
In the above examples, vars
values are strings. But vars
can be any JSON or YAML entity, including nested objects. You can manipulate these objects in the prompt, which are nunjucks templates.
For example, consider this test case, which lists a handful of user and assistant messages in an OpenAI-compatible format:
tests:
- vars:
previous_messages:
- role: user
content: hello world
- role: assistant
content: how are you?
- role: user
content: great, thanks
The corresponding prompt.txt
file simply passes through the previous_messages
object using the dump and safe filters to convert the object to a JSON string:
{{ previous_messages | dump | safe }}
Running promptfoo eval -p prompt.txt -c path_to.yaml
will call the Chat Completion API with the following prompt:
[
{
"role": "user",
"content": "hello world"
},
{
"role": "assistant",
"content": "how are you?"
},
{
"role": "user",
"content": "great, thanks"
}
]
Use Nunjucks templates to exert additional control over your prompt templates, including loops, conditionals, and more.
Other capabilities
Functions
promptfoo supports OpenAI functions and other provider-specific configurations like temperature, number of tokens, and so on.
To use, override the config
key of the provider. See example here.
Postprocessing
The TestCase.options.postprocess
field is a Javascript snippet that modifies the LLM output. Postprocessing occurs before any assertions are run.
Postprocess is a function that takes a string output and a context object:
postprocessFn: (output: string, context: {
vars: Record<string, any>
})
This is useful if you need to somehow transform or clean LLM output before running an eval.
For example:
# ...
tests:
- vars:
language: French
body: Hello world
options:
postprocess: output.toUpperCase()
# ...
Or multiline:
# ...
tests:
- vars:
language: French
body: Hello world
options:
postprocess: |
output = output.replace(context.vars.language, 'foo');
const words = output.split(' ').filter(x => !!x);
return JSON.stringify(words);
# ...
Tip: use defaultTest
apply a postprocessing option to every test case in your test suite.
Configuration structure
For detailed information on the config structure, see Configuration Reference.
Loading tests from CSV
YAML is nice, but some organizations maintain their LLM tests in spreadsheets for ease of collaboration. promptfoo supports a special CSV file format.
prompts: [prompt1.txt, prompt2.txt]
providers: [openai:gpt-3.5-turbo, localai:chat:vicuna]
tests: tests.csv
promptfoo also has built-in ability to pull test cases from a Google Sheet. The sheet must be visible to "anyone with the link". For example:
prompts: [prompt1.txt, prompt2.txt]
providers: [openai:gpt-3.5-turbo, localai:chat:vicuna]
tests: https://docs.google.com/spreadsheets/d/1eqFnv1vzkPvS7zG-mYsqNDwOzvSaiIAsKB3zKg9H18c/edit?usp=sharing
Here's a full example. See also: import tests from another file.