Skip to main content

Getting started

After installing promptfoo, you can set up your first config file in a few ways:

Running an example

Set up your first config file with a pre-built example by running this command with npx, npm, or brew:

npx promptfoo@latest init --example getting-started

This will create a new directory with a basic example that tests translation prompts across different models. The example includes:

  • A configuration file promptfooconfig.yaml with sample prompts, providers, and test cases.
  • A README.md file explaining how the example works.

Most providers need authentication. For OpenAI:

export OPENAI_API_KEY=sk-abc123

Then navigate to the example directory, run the eval, and view results:

cd getting-started
npx promptfoo@latest eval
npx promptfoo@latest view

Set up via the CLI

To start from scratch, run promptfoo init to create a config through an interactive CLI walkthrough:

npx promptfoo@latest init

Set up via the Web UI

If you prefer a visual interface, run promptfoo eval setup to configure your first eval through the web UI:

npx promptfoo@latest eval setup

This opens a browser-based setup flow that walks you through creating prompts, choosing providers, and adding test cases.

Promptfoo eval setup Web UI

Configuration

To configure your evaluation:

  1. Set up your prompts: Open promptfooconfig.yaml and add prompts that you want to test. Use double curly braces for variable placeholders: {{variable_name}}. For example:

    prompts:
    - 'Convert the following English text to {{language}}: {{input}}'

    » More information on setting up prompts

  2. Add providers to specify AI models you want to test. Promptfoo supports 60+ providers including OpenAI, Anthropic, Google, and many others:

    providers:
    - openai:gpt-5.2
    - openai:gpt-5-mini
    - anthropic:messages:claude-opus-4-6
    - google:gemini-3-pro-preview
    # Or use your own custom provider
    - file://path/to/custom/provider.py

    This includes cloud APIs, local models like Ollama, and custom Python or JavaScript code.

    » See all providers

  3. Add test inputs: Add some example inputs for your prompts. Optionally, add assertions to set output requirements that are checked automatically.

    For example:

    tests:
    - vars:
    language: French
    input: Hello world
    assert:
    - type: contains
    value: 'Bonjour le monde'
    - vars:
    language: Spanish
    input: Where is the library?
    assert:
    - type: icontains
    value: 'Dónde está la biblioteca'

    When writing test cases, think of core use cases and potential failures that you want to make sure your prompts handle correctly.

    » More information on setting up tests

  4. Run the evaluation: Make sure you're in the directory containing promptfooconfig.yaml, then run:

    npx promptfoo@latest eval

    This tests every prompt, model, and test case.

  5. After the evaluation is complete, open the web viewer to review the outputs:

    npx promptfoo@latest view

Promptfoo Web UI showing evaluation results

Asserts

The YAML configuration format runs each prompt through a series of test cases and checks if they meet the specified asserts.

Asserts are optional. Many people get value out of reviewing outputs manually, and the web UI helps facilitate this.

tip

See the Configuration docs for a detailed guide.

Examples

Prompt quality

In this example, we evaluate whether adding adjectives to the personality of an assistant bot affects the responses.

You can quickly set up this example by running:

npx promptfoo@latest init --example eval-self-grading
Show YAML file for this example
# yaml-language-server: $schema=https://promptfoo.dev/config-schema.json
description: Automatic response evaluation using LLM rubric scoring

# Load prompts
prompts:
- file://prompts.txt
providers:
- openai:gpt-5.2
defaultTest:
assert:
- type: llm-rubric
value: Do not mention that you are an AI or chat assistant
- type: javascript
# Shorter is better
value: Math.max(0, Math.min(1, 1 - (output.length - 100) / 900));
tests:
- vars:
name: Bob
question: Can you help me find a specific product on your website?
- vars:
name: Jane
question: Do you have any promotions or discounts currently available?
- vars:
name: Ben
question: Can you check the availability of a product at a specific store location?
- vars:
name: Dave
question: What are your shipping and return policies?
- vars:
name: Jim
question: Can you provide more information about the product specifications or features?
- vars:
name: Alice
question: Can you recommend products that are similar to what I've been looking at?
- vars:
name: Sophie
question: Do you have any recommendations for products that are currently popular or trending?
- vars:
name: Jessie
question: How can I track my order after it has been shipped?
- vars:
name: Kim
question: What payment methods do you accept?
- vars:
name: Emily
question: Can you help me with a problem I'm having with my account or order?

From the newly created directory, run npx promptfoo@latest eval to execute this example:

promptfoo command line

This command will evaluate the prompts, substituting variable values, and output the results in your terminal.

You can also output a spreadsheet, JSON, YAML or HTML.

Model quality

In this next example, we evaluate the difference between GPT-5 and GPT-5.2 outputs for a given prompt:

You can quickly set up this example by running:

npx promptfoo@latest init --example compare-openai-models
Show YAML file for this example
# yaml-language-server: $schema=https://promptfoo.dev/config-schema.json
description: Comparing OpenAI flagship and mini models performance on riddles

prompts:
- 'Solve this riddle: {{riddle}}'

providers:
- openai:gpt-5
- openai:gpt-5-mini

defaultTest:
assert:
# Inference should always cost less than this (USD)
- type: cost
threshold: 0.002
# Inference should always be faster than this (milliseconds)
- type: latency
threshold: 3000

tests:
- vars:
riddle: 'I speak without a mouth and hear without ears. I have no body, but I come alive with wind. What am I?'
assert:
# Make sure the LLM output contains this word
- type: contains
value: echo
# Use model-graded assertions to enforce free-form instructions
- type: llm-rubric
value: Do not apologize
- vars:
riddle: "You see a boat filled with people. It has not sunk, but when you look again you don't see a single person on the boat. Why?"
assert:
- type: llm-rubric
value: explains that there are no single people (they are all married)
- vars:
riddle: 'The more of this there is, the less you see. What is it?'
assert:
- type: contains
value: darkness
- vars:
riddle: >-
I have keys but no locks. I have space but no room. You can enter, but
can't go outside. What am I?
- vars:
riddle: >-
I am not alive, but I grow; I don't have lungs, but I need air; I don't
have a mouth, but water kills me. What am I?
- vars:
riddle: What can travel around the world while staying in a corner?
- vars:
riddle: Forward I am heavy, but backward I am not. What am I?
- vars:
riddle: >-
The person who makes it, sells it. The person who buys it, never uses
it. The person who uses it, doesn't know they're using it. What is it?
- vars:
riddle: I can be cracked, made, told, and played. What am I?
- vars:
riddle: What has keys but can't open locks?
- vars:
riddle: >-
I'm light as a feather, yet the strongest person can't hold me for much
more than a minute. What am I?
- vars:
riddle: >-
I can fly without wings, I can cry without eyes. Whenever I go, darkness
follows me. What am I?
- vars:
riddle: >-
I am taken from a mine, and shut up in a wooden case, from which I am
never released, and yet I am used by almost every person. What am I?
- vars:
riddle: >-
David's father has three sons: Snap, Crackle, and _____? What is the
name of the third son?
- vars:
riddle: >-
I am light as a feather, but even the world's strongest man couldn't
hold me for much longer than a minute. What am I?

Navigate to the newly created directory and run npx promptfoo@latest eval or promptfoo eval. Also note that you can override parameters directly from the command line.

For example, if you run this command:

npx promptfoo@latest eval -r google:gemini-3-pro-preview google:gemini-2.5-pro

It produces the following table, with Gemini models replacing the GPT models in the config:

Side-by-side eval of LLM model quality, gemini-3.0-pro vs gemini-2.5

A similar approach can be used to run other model comparisons. For example, you can:

Next steps

Now that you've run your first eval, here are some ways to go deeper:

Customize your setup:

Explore use cases:

Learn from examples: