Getting started
Install promptfoo and set up your first config file by running this command with npx, npm, or brew:
- npx
- npm
- brew
npx promptfoo@latest init --example getting-started
npm install -g promptfoo
promptfoo init --example getting-started
brew install promptfoo
promptfoo init --example getting-started
This will create a new directory with a basic example that tests translation prompts across different models. The example includes:
- A configuration file
promptfooconfig.yaml
with sample prompts, providers, and test cases. - A
README.md
file explaining how the example works.
If you prefer to start from scratch instead of using the example, simply run promptfoo init
without the --example
flag. The command will guide you through an interactive setup process.
Configuration
Next, we can review the example configuration file and make changes to it.
-
Set up your prompts: Open
promptfooconfig.yaml
and add prompts that you want to test. Use double curly braces as placeholders for variables:{{variable_name}}
. For example:prompts:
- 'Convert this English to {{language}}: {{input}}'
- 'Translate to {{language}}: {{input}}' -
Add
providers
to specify which AI models you want to test. promptfoo supports 50+ providers including OpenAI, Anthropic, Google, and many others:providers:
- openai:o3-mini
- anthropic:messages:claude-3-5-sonnet-20241022
- vertex:gemini-pro
# Or use your own custom provider
- file://path/to/custom/provider.pyEach provider is specified using a simple format:
provider_name:model_name
. For example:openai:o3-mini
for OpenAI's o3-minianthropic:messages:claude-3-5-sonnet-20241022
for Anthropic's Claudebedrock:us.meta.llama3-3-70b-instruct-v1:0
for Meta's Llama 3.3 70B via AWS Bedrock
Most providers need authentication. For example, with OpenAI:
export OPENAI_API_KEY=sk-abc123
You can use:
- Cloud APIs: OpenAI, Anthropic, Google, Mistral, and many more
- Local Models: Ollama, llama.cpp, LocalAI
- Custom Code: Python, JavaScript, or any executable
» See our full providers documentation for detailed setup instructions for each provider.
-
Add test inputs: Add some example inputs for your prompts. Optionally, add assertions to set output requirements that are checked automatically.
For example:
tests:
- vars:
language: French
input: Hello world
- vars:
language: Spanish
input: Where is the library?When writing test cases, think of core use cases and potential failures that you want to make sure your prompts handle correctly.
-
Run the evaluation: Make sure you're in the directory containing
promptfooconfig.yaml
, then run:- npx
- npm
- brew
npx promptfoo@latest eval
promptfoo eval
promptfoo eval
This tests every prompt, model, and test case.
-
After the evaluation is complete, open the web viewer to review the outputs:
- npx
- npm
- brew
npx promptfoo@latest view
promptfoo view
promptfoo view
Configuration
The YAML configuration format runs each prompt through a series of example inputs (aka "test case") and checks if they meet requirements (aka "assert").
Asserts are optional. Many people get value out of reviewing outputs manually, and the web UI helps facilitate this.
See the Configuration docs for a detailed guide.
Show example YAML
# yaml-language-server: $schema=https://promptfoo.dev/config-schema.json
prompts:
- file://prompts.txt
providers:
- openai:gpt-4o-mini
tests:
- description: First test case - automatic review
vars:
var1: first variable's value
var2: another value
var3: some other value
assert:
- type: equals
value: expected LLM output goes here
- type: javascript
value: output.includes('some text')
- description: Second test case - manual review
# Test cases don't need assertions if you prefer to review the output yourself
vars:
var1: new value
var2: another value
var3: third value
- description: Third test case - other types of automatic review
vars:
var1: yet another value
var2: and another
var3: dear llm, please output your response in json format
assert:
- type: contains-json
- type: similar
value: ensures that output is semantically similar to this text
- type: llm-rubric
value: must contain a reference to X
Examples
Prompt quality
In this example, we evaluate whether adding adjectives to the personality of an assistant bot affects the responses.
Here is the configuration:
# yaml-language-server: $schema=https://promptfoo.dev/config-schema.json
# Load prompts
prompts:
- file://prompt1.txt
- file://prompt2.txt
# Set an LLM
providers:
- openai:gpt-4o-mini
# These test properties are applied to every test
defaultTest:
assert:
# Verify that the output doesn't contain "AI language model"
- type: not-contains
value: AI language model
# Verify that the output doesn't apologize
- type: llm-rubric
value: must not contain an apology
# Prefer shorter outputs using a scoring function
- type: javascript
value: Math.max(0, Math.min(1, 1 - (output.length - 100) / 900));
# Set up individual test cases
tests:
- vars:
name: Bob
question: Can you help me find a specific product on your website?
assert:
- type: contains
value: search
- vars:
name: Jane
question: Do you have any promotions or discounts currently available?
assert:
- type: starts-with
value: Yes
- vars:
name: Ben
question: Can you check the availability of a product at a specific store location?
# ...
A simple npx promptfoo@latest eval
will run this example from the command line:
This command will evaluate the prompts, substituting variable values, and output the results in your terminal.
Have a look at the setup and full output here.
You can also output a nice spreadsheet, JSON, YAML, or an HTML file:
- npx
- npm
- brew
npx promptfoo@latest eval -o output.html
promptfoo eval -o output.html
promptfoo eval -o output.html
Model quality
In this next example, we evaluate the difference between GPT 4o and GPT 4o mini outputs for a given prompt:
prompts:
- file://prompt1.txt
- file://prompt2.txt
# Set the LLMs we want to test
providers:
- openai:gpt-4o-mini
- openai:gpt-4o
A simple npx promptfoo@latest eval
will run the example. Also note that you can override parameters directly from the command line. For example, this command:
- npx
- npm
- brew
npx promptfoo@latest eval -p prompts.txt -r openai:gpt-4o-mini openai:gpt-4o -o output.html
promptfoo eval -p prompts.txt -r openai:gpt-4o-mini openai:gpt-4o -o output.html
promptfoo eval -p prompts.txt -r openai:gpt-4o-mini openai:gpt-4o -o output.html
Produces this HTML table:
Full setup and output here.
A similar approach can be used to run other model comparisons. For example, you can:
- Compare same models with different temperatures (see GPT temperature comparison)
- Compare Llama vs. GPT (see Llama vs GPT benchmark)
- Compare Retrieval-Augmented Generation (RAG) with LangChain vs. regular GPT-4 (see LangChain example)
Other examples
There are many examples available in the examples/
directory of our Github repository.
Automatically assess outputs
The above examples create a table of outputs that can be manually reviewed. By setting up assertions, you can automatically grade outputs on a pass/fail basis.
For more information on automatically assessing outputs, see Assertions & Metrics.